DOMAIN: Industrial safety. NLP based Chatbot.
CONTEXT:
The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.
DATA DESCRIPTION:
The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.
Columns description:
Link to download the dataset: https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database
PROJECT OBJECTIVE:
Design a ML/DL based chatbot utility which can help the professionals to highlight the safety risk as per the incident description.
PROJECT TASK:
# Import necessary python libraries and ignore unnecessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('max_colwidth', None)
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/gdrive')
file_path = '/content/gdrive/My Drive/Capstone_Group10_NLP1/Dataset_Industrial_Safety_and_Health_Database_with_Accidents_description.xlsx'
# Read the Excel file using pandas
ISH_df = pd.read_excel(file_path)
# Display the first few rows of the dataframe
ISH_df.head()
Mounted at /content/gdrive
| Unnamed: 0 | Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee or Third Party | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. |
| 1 | 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. |
| 2 | 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. |
| 3 | 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. |
| 4 | 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. |
Shape of Input Dataframe:
print("Number of rows = {0} and Number of Columns = {1} in the Data frame".format(ISH_df.shape[0], ISH_df.shape[1]))
Number of rows = 425 and Number of Columns = 11 in the Data frame
Datatype of each attribute:
# Check datatypes
ISH_df.dtypes
| 0 | |
|---|---|
| Unnamed: 0 | int64 |
| Data | datetime64[ns] |
| Countries | object |
| Local | object |
| Industry Sector | object |
| Accident Level | object |
| Potential Accident Level | object |
| Genre | object |
| Employee or Third Party | object |
| Critical Risk | object |
| Description | object |
This output shows that most of the columns are of type 'object', which typically means they contain string data.
The 'Data' column is of type 'datetime64[ns]', and 'Unnamed: 0' is of type 'int64'.
# Check Dataframe info
ISH_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 425 entries, 0 to 424 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 425 non-null int64 1 Data 425 non-null datetime64[ns] 2 Countries 425 non-null object 3 Local 425 non-null object 4 Industry Sector 425 non-null object 5 Accident Level 425 non-null object 6 Potential Accident Level 425 non-null object 7 Genre 425 non-null object 8 Employee or Third Party 425 non-null object 9 Critical Risk 425 non-null object 10 Description 425 non-null object dtypes: datetime64[ns](1), int64(1), object(9) memory usage: 36.6+ KB
# Missing value count
ISH_df.isnull().sum()
| 0 | |
|---|---|
| Unnamed: 0 | 0 |
| Data | 0 |
| Countries | 0 |
| Local | 0 |
| Industry Sector | 0 |
| Accident Level | 0 |
| Potential Accident Level | 0 |
| Genre | 0 |
| Employee or Third Party | 0 |
| Critical Risk | 0 |
| Description | 0 |
# Dropping Unnecessary Columns:
ISH_df.drop("Unnamed: 0", axis=1, inplace=True)
Unnamed: 0: This column appears to be an index column and does not provide any useful information for analysis.
ISH_df.head()
| Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee or Third Party | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. |
| 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. |
| 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. |
# Renaming the columns as per available Data and Description
ISH_df.rename(columns={
"Data": "Date",
"Countries": "Country",
"Local": "City",
"Genre": "Gender",
"Employee or Third Party":"Employee Type",
}, inplace=True)
# Modify 'City' column values
ISH_df['City'] = ISH_df['City'].str.replace('Local_', 'City_')
ISH_df.head()
| Date | Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. |
| 1 | 2016-01-02 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. |
| 2 | 2016-01-06 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. |
| 3 | 2016-01-08 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. |
| 4 | 2016-01-10 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. |
# Check for Duplicate rows in the dataset
Duplicate_Rows = ISH_df.duplicated().sum()
print('Number of duplicate rows:', Duplicate_Rows)
Number of duplicate rows: 7
# View Duplicate records
Duplicates = ISH_df.duplicated()
ISH_df[Duplicates]
| Date | Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 77 | 2016-04-01 | Country_01 | City_01 | Mining | I | V | Male | Third Party (Remote) | Others | In circumstances that two workers of the Abratech company were doing putty work inside the conditioning tank (5 meters deep and covered by platforms) of metal gratings - grating- in the upper part), two other employees of the HyT company carried out maneuvers transfer of a pump with the help of a manual tick - which worked hooked to a beam H, dragging the pump on the metal gratings (grating), suddenly the pump is hooked with a metal grate (grating) and when trying to release it, the metal grid (grating - 13.0 Kg. (60 cm x 92 cm)) falls inside the tank, hits a diagonal channel inside the tank and then impacts the right arm of one of the workers and rubs the helmet of the second worker that he was crouching. The area where the bomb was being moved was marked with tape and did not have a lookout. |
| 262 | 2016-12-01 | Country_01 | City_03 | Mining | I | IV | Male | Employee | Others | During the activity of chuteo of ore in hopper OP5; the operator of the locomotive parks his equipment under the hopper to fill the first car, it is at this moment that when it was blowing out to release the load, a mud flow suddenly appears with the presence of rock fragments; the personnel that was in the direction of the flow was covered with mud. |
| 303 | 2017-01-21 | Country_02 | City_02 | Mining | I | I | Male | Third Party (Remote) | Others | Employees engaged in the removal of material from the excavation of the well 2 of level 265, using shovel and placing it in the bucket. During the day some of this material fell into the pipes of the employees' boots and the friction between the boot and the calf caused a superficial injury to the legs. |
| 345 | 2017-03-02 | Country_03 | City_10 | Others | I | I | Male | Third Party | Venomous Animals | On 02/03/17 during the soil sampling in the region of Sta. the employees Rafael and Danillo da Silva were attacked by a bee test. They rushed away from the place, but the employee Rafael took 4 bites, one on the chin, one on the chest, one on the neck and one on the hand over the glove. The employee took 4 bites, one in his hand over his glove and the other in the head, and the employee Danillo took 2 bites in the left arm over his uniform. At first no one sketched allergy, just swelling at the sting site. The activity was stopped to evaluate the site, after verifying that the test had remained in the line, they left the site. |
| 346 | 2017-03-02 | Country_03 | City_10 | Others | I | I | Male | Third Party | Venomous Animals | On 02/03/17 during the soil sampling in the region of Sta. the employees Rafael and Danillo da Silva were attacked by a bee test. They rushed away from the place, but the employee Rafael took 4 bites, one on the chin, one on the chest, one on the neck and one on the hand over the glove. The employee took 4 bites, one in his hand over his glove and the other in the head, and the employee Danillo took 2 bites in the left arm over his uniform. At first no one sketched allergy, just swelling at the sting site. The activity was stopped to evaluate the site, after verifying that the test had remained in the line, they left the site. |
| 355 | 2017-03-15 | Country_03 | City_10 | Others | I | I | Male | Third Party | Venomous Animals | Team of the VMS Project performed soil collection on the Xixás target with 3 members. When the teams were moving from one collection point to another, Mr. Fabio was ahead of the team, stinging behind Robson and Manoel da Silva. near the collection point were surprised by a swarm of bees that was inside a I play near the ground, with no visibility in the woods and no hissing noise. Fabio passed by the stump, but Robson and Manoel da Silva were attacked by the bees. Robson had a sting in his left arm over his uniform and Manoel da Silva had a prick in his lip as his screen ripped as he tangled in the branches during the escape. |
| 397 | 2017-05-23 | Country_01 | City_04 | Mining | I | IV | Male | Third Party | Projection of fragments | In moments when the 02 collaborators carried out the inspection of the conveyor belt No. 3 from the tail pulley when they were at the height of the load polymer No. 372, the Maslucan collaborator heard a noise where note that the belt was moving towards the tail pulley, 4 "fragmentos mineral fragments are projected towards the access of the ramp impacting the 2 collaborators, being evacuated to the medical post. |
# Remove duplicate rows and save the deduplicated dataset
ISH_df_cleaned = ISH_df.drop_duplicates()
# Save the deduplicated dataset to a new file
ISH_df_cleaned.to_csv('ISH_df_cleaned.csv', index=False)
# Print the number of rows before and after deduplication
print('Number of rows before deduplication:', len(ISH_df))
print('Number of rows after deduplication:', len(ISH_df_cleaned))
Number of rows before deduplication: 425 Number of rows after deduplication: 418
# Shape of Deduplicated Dataframe 'ISH_df_cleaned'
ISH_df_cleaned.shape
print("Number of rows = {0} and Number of Columns = {1} in the Data frame after removing the duplicates.".format(ISH_df_cleaned.shape[0], ISH_df_cleaned.shape[1]))
Number of rows = 418 and Number of Columns = 10 in the Data frame after removing the duplicates.
# Check unique values for each column in the deduplicated dataframe
# Check for unique values in each column
Unique_Values = ISH_df_cleaned.nunique()
Unique_Values
| 0 | |
|---|---|
| Date | 287 |
| Country | 3 |
| City | 12 |
| Industry Sector | 3 |
| Accident Level | 5 |
| Potential Accident Level | 6 |
| Gender | 2 |
| Employee Type | 3 |
| Critical Risk | 33 |
| Description | 411 |
# Check Cleaned Dataframe info
ISH_df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 418 entries, 0 to 424 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 418 non-null datetime64[ns] 1 Country 418 non-null object 2 City 418 non-null object 3 Industry Sector 418 non-null object 4 Accident Level 418 non-null object 5 Potential Accident Level 418 non-null object 6 Gender 418 non-null object 7 Employee Type 418 non-null object 8 Critical Risk 418 non-null object 9 Description 418 non-null object dtypes: datetime64[ns](1), object(9) memory usage: 35.9+ KB
# Identify numerical and categorical columns
numerical_columns = ISH_df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = ISH_df_cleaned.select_dtypes(exclude=[np.number]).columns.tolist()
# Exclude 'Data' column from categorical columns
categorical_columns = [col for col in categorical_columns if col != 'Date']
print('Numerical columns:', numerical_columns)
print('Categorical columns:', categorical_columns)
Numerical columns: [] Categorical columns: ['Country', 'City', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description']
# Check unique values in the 'Data' column
Unique_Dates = ISH_df_cleaned['Date'].unique()
Unique_Dates
<DatetimeArray> ['2016-01-01 00:00:00', '2016-01-02 00:00:00', '2016-01-06 00:00:00', '2016-01-08 00:00:00', '2016-01-10 00:00:00', '2016-01-12 00:00:00', '2016-01-16 00:00:00', '2016-01-17 00:00:00', '2016-01-19 00:00:00', '2016-01-26 00:00:00', ... '2017-06-24 00:00:00', '2017-06-20 00:00:00', '2017-06-23 00:00:00', '2017-06-19 00:00:00', '2017-06-22 00:00:00', '2017-06-29 00:00:00', '2017-07-04 00:00:00', '2017-07-05 00:00:00', '2017-07-06 00:00:00', '2017-07-09 00:00:00'] Length: 287, dtype: datetime64[ns]
# Viewing result in the form of separate dataframes for each attribute
for column in ISH_df_cleaned.columns:
print(f'\nDataFrame for {column}:')
if column in categorical_columns:
df_temp = pd.DataFrame(ISH_df_cleaned[column].value_counts()).reset_index()
df_temp.columns = [column, 'Count']
# Calculate percentage
total = df_temp['Count'].sum()
df_temp['Percentage'] = (df_temp['Count'] / total * 100).round(2) # Round off to 2 decimal points
else:
df_temp = pd.DataFrame(ISH_df_cleaned[column].describe()).reset_index()
df_temp.columns = ['Statistic', column]
display(df_temp)
print('-' * 50)
DataFrame for Date:
| Statistic | Date | |
|---|---|---|
| 0 | count | 418 |
| 1 | mean | 2016-09-18 20:50:31.578947328 |
| 2 | min | 2016-01-01 00:00:00 |
| 3 | 25% | 2016-04-30 06:00:00 |
| 4 | 50% | 2016-09-06 00:00:00 |
| 5 | 75% | 2017-02-06 12:00:00 |
| 6 | max | 2017-07-09 00:00:00 |
-------------------------------------------------- DataFrame for Country:
| Country | Count | Percentage | |
|---|---|---|---|
| 0 | Country_01 | 248 | 59.33 |
| 1 | Country_02 | 129 | 30.86 |
| 2 | Country_03 | 41 | 9.81 |
-------------------------------------------------- DataFrame for City:
| City | Count | Percentage | |
|---|---|---|---|
| 0 | City_03 | 89 | 21.29 |
| 1 | City_05 | 59 | 14.11 |
| 2 | City_01 | 56 | 13.40 |
| 3 | City_04 | 55 | 13.16 |
| 4 | City_06 | 46 | 11.00 |
| 5 | City_10 | 41 | 9.81 |
| 6 | City_08 | 27 | 6.46 |
| 7 | City_02 | 23 | 5.50 |
| 8 | City_07 | 14 | 3.35 |
| 9 | City_12 | 4 | 0.96 |
| 10 | City_09 | 2 | 0.48 |
| 11 | City_11 | 2 | 0.48 |
-------------------------------------------------- DataFrame for Industry Sector:
| Industry Sector | Count | Percentage | |
|---|---|---|---|
| 0 | Mining | 237 | 56.70 |
| 1 | Metals | 134 | 32.06 |
| 2 | Others | 47 | 11.24 |
-------------------------------------------------- DataFrame for Accident Level:
| Accident Level | Count | Percentage | |
|---|---|---|---|
| 0 | I | 309 | 73.92 |
| 1 | II | 40 | 9.57 |
| 2 | III | 31 | 7.42 |
| 3 | IV | 30 | 7.18 |
| 4 | V | 8 | 1.91 |
-------------------------------------------------- DataFrame for Potential Accident Level:
| Potential Accident Level | Count | Percentage | |
|---|---|---|---|
| 0 | IV | 141 | 33.73 |
| 1 | III | 106 | 25.36 |
| 2 | II | 95 | 22.73 |
| 3 | I | 45 | 10.77 |
| 4 | V | 30 | 7.18 |
| 5 | VI | 1 | 0.24 |
-------------------------------------------------- DataFrame for Gender:
| Gender | Count | Percentage | |
|---|---|---|---|
| 0 | Male | 396 | 94.74 |
| 1 | Female | 22 | 5.26 |
-------------------------------------------------- DataFrame for Employee Type:
| Employee Type | Count | Percentage | |
|---|---|---|---|
| 0 | Third Party | 185 | 44.26 |
| 1 | Employee | 178 | 42.58 |
| 2 | Third Party (Remote) | 55 | 13.16 |
-------------------------------------------------- DataFrame for Critical Risk:
| Critical Risk | Count | Percentage | |
|---|---|---|---|
| 0 | Others | 229 | 54.78 |
| 1 | Pressed | 24 | 5.74 |
| 2 | Manual Tools | 20 | 4.78 |
| 3 | Chemical substances | 17 | 4.07 |
| 4 | Cut | 14 | 3.35 |
| 5 | Projection | 13 | 3.11 |
| 6 | Venomous Animals | 13 | 3.11 |
| 7 | Bees | 10 | 2.39 |
| 8 | Fall | 9 | 2.15 |
| 9 | Vehicles and Mobile Equipment | 8 | 1.91 |
| 10 | remains of choco | 7 | 1.67 |
| 11 | Fall prevention (same level) | 7 | 1.67 |
| 12 | Pressurized Systems | 7 | 1.67 |
| 13 | Fall prevention | 6 | 1.44 |
| 14 | Suspended Loads | 6 | 1.44 |
| 15 | Liquid Metal | 3 | 0.72 |
| 16 | Pressurized Systems / Chemical Substances | 3 | 0.72 |
| 17 | Power lock | 3 | 0.72 |
| 18 | Blocking and isolation of energies | 3 | 0.72 |
| 19 | Electrical Shock | 2 | 0.48 |
| 20 | Machine Protection | 2 | 0.48 |
| 21 | Poll | 1 | 0.24 |
| 22 | Confined space | 1 | 0.24 |
| 23 | Electrical installation | 1 | 0.24 |
| 24 | Not applicable | 1 | 0.24 |
| 25 | Plates | 1 | 0.24 |
| 26 | Projection/Burning | 1 | 0.24 |
| 27 | Traffic | 1 | 0.24 |
| 28 | Projection/Choco | 1 | 0.24 |
| 29 | Burn | 1 | 0.24 |
| 30 | Projection/Manual Tools | 1 | 0.24 |
| 31 | Individual protection equipment | 1 | 0.24 |
| 32 | Projection of fragments | 1 | 0.24 |
-------------------------------------------------- DataFrame for Description:
| Description | Count | Percentage | |
|---|---|---|---|
| 0 | During the activity of chuteo of ore in hopper OP5; the operator of the locomotive parks his equipment under the hopper to fill the first car, it is at this moment that when it was blowing out to release the load, a mud flow suddenly appears with the presence of rock fragments; the personnel that was in the direction of the flow was covered with mud. | 2 | 0.48 |
| 1 | The employees Márcio and Sérgio performed the pump pipe clearing activity FZ1.031.4 and during the removal of the suction spool flange bolts, there was projection of pulp over them causing injuries. | 2 | 0.48 |
| 2 | In the geological reconnaissance activity, in the farm of Mr. Lázaro, the team composed by Felipe and Divino de Morais, in normal activity encountered a ciliary forest, as they needed to enter the forest to verify a rock outcrop which was the front, the Divine realized the opening of the access with machete. At that moment, took a bite from his neck. There were no more attacks, no allergic reaction, and continued work normally. With the work completed, leaving the forest for the same access, the Divine assistant was attacked by a snake and suffered a sting in the forehead. At that moment they moved away from the area. It was verified that there was no type of allergic reaction and returned with normal activities. | 2 | 0.48 |
| 3 | At moments when the MAPERU truck of plate F1T 878, returned from the city of Pasco to the Unit transporting a consultant, being 350 meters from the main gate his lane is invaded by a civilian vehicle, making the driver turn sharply to the side right where was staff of the company IMPROMEC doing hot melt work in an 8 "pipe impacting two collaborators causing the injuries described At the time of the accident the truck was traveling at 37km / h - according to INTHINC -, the width of the road is of 6 meters, the activity had safety cones as a warning on both sides of the road and employees used their respective EPP'S. | 2 | 0.48 |
| 4 | When starting the activity of removing a coil of electric cables in the warehouse with the help of forklift truck the operator did not notice that there was a beehive in it. Due to the movement of the coil the bees were excited. Realizing the fact the operator turned off the equipment and left the area. People passing by were stung. | 2 | 0.48 |
| ... | ... | ... | ... |
| 406 | Being 01:50 p.m. approximately, in the Nv. 1800, in the Tecnomin winery. Mr. Chagua - Bodeguero was alone, cutting wires No. 16 with a grinder, previously he had removed the protection guard from the disk of 4 inches in diameter and adapted a disk of a crosscutter of approximately 8 inches. Originating traumatic amputation of two fingers of the left hand | 1 | 0.24 |
| 407 | In circumstances that the collaborator performed the cleaning of the ditch 3570, 0.50 cm deep, removing the pipe of 2 "HDPE material with an estimated weight of 30 Kg. Together with two collaborators, when pushing the tube to drain the dune, the collaborator is hit on the lower right side lip producing a slight blow to the lip. At the time of the event, the collaborator had a safety helmet, glasses and gloves. | 1 | 0.24 |
| 408 | During the process of washing the material (Becker), the tip of the material was broken which caused a cut of the 5th finger of the right hand | 1 | 0.24 |
| 409 | The clerk was peeling and pulling a sheet came another one that struck in his 5th chirodactile of the left hand tearing his PVC sleeve caused a cut. | 1 | 0.24 |
| 410 | Once the mooring of the faneles in the detonating cord has been completed, the injured person proceeds to tie the detonating cord in the safety guide (slow wick) at a distance of 2.0 meters from the top of the work. At that moment, to finish mooring, a rock bank (30cm x 50cm x 15cm; 67.5 Kg.) the same front, from a height of 1.60 meters, which falls to the floor very close to the injured, disintegrates in several fragments, one of which (12cmx10cmx3cm, 2.0 Kg.) slides between the fragments of rock and impacts with the left leg of the victim. At the time of the accident the operator used his safety boots and was accompanied by a supervisor. | 1 | 0.24 |
411 rows × 3 columns
--------------------------------------------------
Overall:
Specific Observations:
Potential Areas for Further Analysis:
ISH_df_cleaned
| Date | Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. |
| 1 | 2016-01-02 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. |
| 2 | 2016-01-06 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. |
| 3 | 2016-01-08 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. |
| 4 | 2016-01-10 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 420 | 2017-07-04 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | Being approximately 5:00 a.m. approximately, when lifting the Kelly HQ towards the pulley of the frame to align it, the assistant Marco that is in the later one is struck the hand against the frame generating the injury. |
| 421 | 2017-07-04 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | The collaborator moved from the infrastructure office (Julio to the toilets, when the pin of the right shoe is hooked on the bra of the left shoe causing not to take the step and fall untimely, causing injury described. |
| 422 | 2017-07-05 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | During the environmental monitoring activity in the area, the employee was surprised by a swarming swarm of weevils. During the exit of the place, endured suffering two stings, being one in the face and the other in the middle finger of the left hand. |
| 423 | 2017-07-06 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | The Employee performed the activity of stripping cathodes, when pulling the cathode sheet his hand hit the side of another cathode, causing a blunt cut on his 2nd finger of the left hand. |
| 424 | 2017-07-09 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | At 10:00 a.m., when the assistant cleaned the floor of module "E" in the central camp, she slipped back and immediately grabbed the laundry table to avoid falling to the floor; suffering the described injury. |
418 rows × 10 columns
from google.colab import drive
drive.mount('/content/drive')
ISH_df_cleaned.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/ISH_df_cleaned.csv', index=False)
Mounted at /content/drive
# @title Potential Accident Level Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('Potential Accident Level').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.show()
# @title Accident Level Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('Accident Level').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.show()
# @title Industry Sector Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('Industry Sector').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.xlabel('Count')
plt.show()
# @title Country Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('Country').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.xlabel('Count')
plt.show()
# @title City Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('City').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.xlabel('Count')
plt.show()
# @title Critical Risk Distribution
# Calculate counts and percentages
counts = ISH_df_cleaned.groupby('Critical Risk').size().sort_values(ascending=True)
total = counts.sum()
percentages = (counts / total * 100).round(2)
# Create bar plot
plt.figure(figsize=(10, 10)) # Adjust figure size as needed
ax = counts.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right']].set_visible(False)
# Add count and percentage labels to bars
for i, (count, percentage) in enumerate(zip(counts, percentages)):
ax.text(count + 5, i, f'{count} ({percentage}%)', va='center')
plt.xlabel('Count')
plt.title('Critical Risk Distribution')
plt.show()
# @title Accident Level and Potential Accident Level vs Gender
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure and axes
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Plot Accident Level vs Gender
sns.countplot(x='Accident Level', hue='Gender', data=ISH_df_cleaned, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Gender')
# Plot Potential Accident Level vs Gender
sns.countplot(x='Potential Accident Level', hue='Gender', data=ISH_df_cleaned, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Gender')
# Rotate x-axis labels for better readability
plt.setp(axes[0].get_xticklabels(), rotation=0)
plt.setp(axes[1].get_xticklabels(), rotation=0)
# Adjust layout and display the plot
plt.tight_layout()
plt.show()
Accident Level vs Gender:
Potential Accident Level vs Gender:
# @title Accident Level and Potential Accident Level vs Employee Type
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure and axes
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Plot Accident Level vs Gender
sns.countplot(x='Accident Level', hue='Employee Type', data=ISH_df_cleaned, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Employee Type')
# Plot Potential Accident Level vs Gender
sns.countplot(x='Potential Accident Level', hue='Employee Type', data=ISH_df_cleaned, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Employee Type')
# Rotate x-axis labels for better readability
plt.setp(axes[0].get_xticklabels(), rotation=0)
plt.setp(axes[1].get_xticklabels(), rotation=0)
# Adjust layout and display the plot
plt.tight_layout()
plt.show()
Accident Level vs Employee Type:
Potential Accident Level vs Employee Type:
# @title Accident Level and Potential Accident Over Years and Months
# Extract year and month from the 'Date' column
ISH_df_cleaned['Year'] = ISH_df_cleaned['Date'].dt.year
ISH_df_cleaned['Month'] = ISH_df_cleaned['Date'].dt.month
# Plot Accident Level and Potential Accident Level against Year
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.countplot(x='Year', hue='Accident Level', data=ISH_df_cleaned, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Year')
sns.countplot(x='Year', hue='Potential Accident Level', data=ISH_df_cleaned, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Year')
plt.tight_layout()
plt.show()
# Plot Accident Level and Potential Accident Level against Month
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
sns.countplot(x='Month', hue='Accident Level', data=ISH_df_cleaned, ax=axes[0], palette='Set2')
axes[0].set_title('Accident Level vs Month')
sns.countplot(x='Month', hue='Potential Accident Level', data=ISH_df_cleaned, ax=axes[1], palette='Set2')
axes[1].set_title('Potential Accident Level vs Month')
plt.tight_layout()
plt.show()
Accident Level vs Year:
Potential Accident Level vs Year:
Accident Level vs Month:
Potential Accident Level vs Month:
# @title Monthly Frequency of Accidents Over Years
# Group by year and month and count accidents
monthly_accidents = ISH_df_cleaned.groupby(['Year', 'Month'])['Date'].count().reset_index(name='Accident Count')
# Pivot the table for plotting
monthly_accidents_pivot = monthly_accidents.pivot(index='Month', columns='Year', values='Accident Count')
# Plot the monthly accident frequency for each year
plt.figure(figsize=(10, 6))
monthly_accidents_pivot.plot(kind='line', marker='o')
plt.title('Monthly Frequency of Accidents Over Years', fontsize=12)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.xticks(range(1, 13)) # Set x-axis ticks to represent months
plt.legend(title='Year', loc='upper right')
plt.grid(False, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
Overall Trend:
Seasonal Variations:
Year-to-Year Fluctuations:
Further Analysis:
# @title Date vs Potential Accident Level count()
from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
palette = list(sns.palettes.mpl_palette('Dark2'))
counted = (series['Date']
.value_counts()
.reset_index(name='counts')
.rename({'index': 'Date'}, axis=1)
.sort_values('Date', ascending=True))
xs = counted['Date']
ys = counted['counts']
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = ISH_df_cleaned.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Potential Accident Level')):
_plot_series(series, series_name, i)
fig.legend(title='Potential Accident Level', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')
Trend Over Time:
Potential Accident Level IV:
Fluctuations and Peaks:
No Clear Pattern:
# @title Date vs Accident Level count()
from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
palette = list(sns.palettes.mpl_palette('Dark2'))
counted = (series['Date']
.value_counts()
.reset_index(name='counts')
.rename({'index': 'Date'}, axis=1)
.sort_values('Date', ascending=True))
xs = counted['Date']
ys = counted['counts']
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = ISH_df_cleaned.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Accident Level')):
_plot_series(series, series_name, i)
fig.legend(title='Accident Level', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')
Trend Over Time:
Accident Levels I and II:
Fluctuations and Peaks:
No Clear Pattern:
# @title Date vs Industry Sector count()
from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
palette = list(sns.palettes.mpl_palette('Dark2'))
counted = (series['Date']
.value_counts()
.reset_index(name='counts')
.rename({'index': 'Date'}, axis=1)
.sort_values('Date', ascending=True))
xs = counted['Date']
ys = counted['counts']
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = ISH_df_cleaned.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Industry Sector')):
_plot_series(series, series_name, i)
fig.legend(title='Industry Sector', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')
Mining Sector:
Fluctuations and Peaks:
Other Sectors:
No Clear Trend:
Importance of Sector-Specific Analysis:
# @title Date vs Country count()
from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
palette = list(sns.palettes.mpl_palette('Dark2'))
counted = (series['Date']
.value_counts()
.reset_index(name='counts')
.rename({'index': 'Date'}, axis=1)
.sort_values('Date', ascending=True))
xs = counted['Date']
ys = counted['counts']
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
fig, ax = plt.subplots(figsize=(15, 5), layout='constrained')
df_sorted = ISH_df_cleaned.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Country')):
_plot_series(series, series_name, i)
fig.legend(title='Country', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('count()')
Country_01:
Fluctuations and Peaks:
Country_02 and Country_03:
No Clear Trend:
Country-Specific Factors:
# Remove 'Year' and 'Month' columns from the dataframe
ISH_df_cleaned = ISH_df_cleaned.drop(['Year', 'Month'], axis=1)
# @title Accident Level vs Potential Accident Level
# Create a cross-tabulation of Accident Level and Potential Accident Level
df_2dhist = pd.DataFrame({
x_label: grp['Potential Accident Level'].value_counts()
for x_label, grp in ISH_df_cleaned.groupby('Accident Level')
})
# Plot a heatmap
plt.figure(figsize=(9, 8))
sns.heatmap(df_2dhist, annot=True, cmap='Set3')
plt.title('Relationship between Accident Level and Potential Accident Level')
plt.xlabel('Potential Accident Level')
plt.ylabel('Accident Level')
plt.show()
Diagonal Dominance:
Potential for Worse Outcomes:
Preventive Measures:
Focus Areas for Improvement:
plt.figure(figsize=(10, 6))
sns.countplot(x='Accident Level', hue='Potential Accident Level', data=ISH_df_cleaned, palette='Set2')
plt.title('Accident Level vs Potential Accident Level')
plt.show()
# @title Industry Sector vs Accident Level
# Group the data by Industry Sector and Accident Level, counting occurrences
grouped_data = ISH_df_cleaned.groupby(['Industry Sector', 'Accident Level'])['Accident Level'].count().unstack().fillna(0)
# Plot a stacked bar chart
grouped_data.plot(kind='bar', stacked=True, figsize=(8, 6),cmap='Set3')
plt.title('Industry Sector vs Accident Level')
plt.xlabel('Industry Sector')
plt.ylabel('Number of Accidents')
plt.xticks(rotation=0, ha='right')
plt.legend(title='Accident Level')
plt.tight_layout()
plt.show()
Mining Sector:
Other Sectors:
Severity Distribution:
Focus Areas for Improvement:
# @title Distribution of Accident Levels Across Countries
import matplotlib.pyplot as plt
# Assuming 'ISH_df_cleaned' is your DataFrame
city_accident_counts = ISH_df_cleaned.groupby(['Country', 'Accident Level'])['Accident Level'].count().unstack()
city_accident_counts.plot(kind='bar', figsize=(10, 6), cmap='Set3')
plt.xlabel('Country')
plt.ylabel('Number of Accidents')
plt.title('Distribution of Accident Levels Across Countries')
plt.xticks(rotation=90)
_ = plt.tight_layout()
Country_01
Country_02
Country_03
Across all countries, the number of accidents decreases as the accident level increases. This is expected, as more severe accidents are generally less frequent.
The distribution of accident levels varies across countries, highlighting potential differences in safety regulations, industry practices, or risk factors specific to each country.
# @title Distribution of Accident Levels Across Cities
import matplotlib.pyplot as plt
# Assuming 'ISH_df_cleaned' is your DataFrame
city_accident_counts = ISH_df_cleaned.groupby(['City', 'Accident Level'])['Accident Level'].count().unstack()
city_accident_counts.plot(kind='bar', figsize=(15, 6), cmap='Set3')
plt.xlabel('City')
plt.ylabel('Number of Accidents')
plt.title('Distribution of Accident Levels Across Cities')
plt.xticks(rotation=90)
_ = plt.tight_layout()
Accident Distribution:
Severity Variation:
City-Specific Patterns:
Potential Focus Areas:
# @title Country vs Industry Sector
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
plt.subplots(figsize=(7, 6))
df_2dhist = pd.DataFrame({
x_label: grp['Industry Sector'].value_counts()
for x_label, grp in ISH_df_cleaned.groupby('Country')
})
sns.heatmap(df_2dhist, cmap='Set3')
plt.xlabel('Country', fontsize=10)
_ = plt.ylabel('Industry Sector')
Country_01:
Country_02:
Country_03:
Overall:
# @title Critical Risk vs Industry Sector
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Industry Sector', data=ISH_df_cleaned, palette='Set2')
plt.title('Industry Sector vs Critical Risk')
plt.show()
Environmental Risk:
Health and Safety Risk:
Process Safety Risk:
Other Risks:
Sector-Specific Risks:
Focus Areas for Improvement:
# @title Critical Risk vs Accident Level
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Accident Level', data=ISH_df_cleaned, palette='Set2')
plt.title('Critical Risk vs Accident Level')
plt.show()
Environmental Risk:
Health and Safety Risk:
Process Safety Risk:
Other Risks:
Focus Areas for Improvement:
# @title Critical Risk vs Potential Accident Level
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Potential Accident Level', data=ISH_df_cleaned, palette='Set2')
plt.title('Critical Risk vs Potential Accident Level')
plt.show()
Environmental Risk:
Health and Safety Risk:
Process Safety Risk:
Other Risks:
Potential Accident Level and Risk Correlation:
Focus Areas for Improvement:
# @title Critical Risk vs Employee Type
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Employee Type', data=ISH_df_cleaned, palette='Set2')
plt.title('Employee Type vs Critical Risk')
plt.show()
Environmental Risk:
Health and Safety Risk:
Process Safety Risk:
Other Risks:
Employee Type and Risk Correlation:
Focus Areas for Improvement:
!pip install holidays
Requirement already satisfied: holidays in /usr/local/lib/python3.10/dist-packages (0.55) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from holidays) (2.8.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil->holidays) (1.16.0)
# Provide list of brazilian holidays in 2016 and 2017
import holidays
# Get Brazilian holidays for 2016 and 2017
brazil_holidays_2016 = holidays.Brazil(years=2016)
brazil_holidays_2017 = holidays.Brazil(years=2017)
# Extract the holidays
holidays_2016 = list(brazil_holidays_2016.items())
holidays_2017 = list(brazil_holidays_2017.items())
# Create dataframes for the holidays
holidays_2016_df = pd.DataFrame(holidays_2016, columns=['Date', 'Holiday'])
holidays_2017_df = pd.DataFrame(holidays_2017, columns=['Date', 'Holiday'])
# Concatenate the two dataframes
all_holidays_df = pd.concat([holidays_2016_df, holidays_2017_df])
# Display the combined dataframe
all_holidays_df
| Date | Holiday | |
|---|---|---|
| 0 | 2016-01-01 | Confraternização Universal |
| 1 | 2016-03-25 | Sexta-feira Santa |
| 2 | 2016-04-21 | Tiradentes |
| 3 | 2016-05-01 | Dia do Trabalhador |
| 4 | 2016-09-07 | Independência do Brasil |
| 5 | 2016-10-12 | Nossa Senhora Aparecida |
| 6 | 2016-11-02 | Finados |
| 7 | 2016-11-15 | Proclamação da República |
| 8 | 2016-12-25 | Natal |
| 0 | 2017-01-01 | Confraternização Universal |
| 1 | 2017-04-14 | Sexta-feira Santa |
| 2 | 2017-04-21 | Tiradentes |
| 3 | 2017-05-01 | Dia do Trabalhador |
| 4 | 2017-09-07 | Independência do Brasil |
| 5 | 2017-10-12 | Nossa Senhora Aparecida |
| 6 | 2017-11-02 | Finados |
| 7 | 2017-11-15 | Proclamação da República |
| 8 | 2017-12-25 | Natal |
import holidays
from datetime import datetime
# Assuming 'Date' column is in the format 'YYYY-MM-DD'
def add_date_features(df):
"""
Adds Weekend, Holiday, Season, DayOfWeek, Year, Month, and Day columns to the dataframe.
Args:
ISH_df_cleaned: The dataframe to add features to.
Returns:
The dataframe with the added features.
"""
# Create a copy of the dataframe
ISH_df_preprocess = ISH_df_cleaned.copy()
# Convert 'Date' to datetime objects
ISH_df_preprocess['Date'] = pd.to_datetime(ISH_df_preprocess['Date'])
# Create Brazilian holidays calendar
br_holidays = holidays.Brazil()
# Add Weekend feature
ISH_df_preprocess['Weekend'] = ISH_df_preprocess['Date'].dt.dayofweek.isin([5, 6]).astype(int)
# Add Holiday feature
ISH_df_preprocess['Holiday'] = ISH_df_preprocess['Date'].apply(lambda date: 1 if date in br_holidays else 0)
# Add Season feature
ISH_df_preprocess['Season'] = ISH_df_preprocess['Date'].dt.month.apply(lambda month:
'Summer' if month in [12, 1, 2] else
'Autumn' if month in [3, 4, 5] else
'Winter' if month in [6, 7, 8] else
'Spring')
# Add DayOfWeek feature
ISH_df_preprocess['DayOfWeek'] = ISH_df_preprocess['Date'].dt.dayofweek
# Split Date into Year, Month, and Day
ISH_df_preprocess['Year'] = ISH_df_preprocess['Date'].dt.year
ISH_df_preprocess['Month'] = ISH_df_preprocess['Date'].dt.month
ISH_df_preprocess['Day'] = ISH_df_preprocess['Date'].dt.day
# Remove Date column
ISH_df_preprocess = ISH_df_preprocess.drop('Date', axis=1)
return ISH_df_preprocess # Return the modified dataframe
# Apply the function to dataframe and store the result
ISH_df_preprocess = add_date_features(ISH_df_cleaned)
ISH_df_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | Weekend | Holiday | Season | DayOfWeek | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. | 0 | 1 | Summer | 4 | 2016 | 1 | 1 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. | 1 | 0 | Summer | 5 | 2016 | 1 | 2 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. | 0 | 0 | Summer | 2 | 2016 | 1 | 6 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. | 0 | 0 | Summer | 4 | 2016 | 1 | 8 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. | 1 | 0 | Summer | 6 | 2016 | 1 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 420 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | Being approximately 5:00 a.m. approximately, when lifting the Kelly HQ towards the pulley of the frame to align it, the assistant Marco that is in the later one is struck the hand against the frame generating the injury. | 0 | 0 | Winter | 1 | 2017 | 7 | 4 |
| 421 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | The collaborator moved from the infrastructure office (Julio to the toilets, when the pin of the right shoe is hooked on the bra of the left shoe causing not to take the step and fall untimely, causing injury described. | 0 | 0 | Winter | 1 | 2017 | 7 | 4 |
| 422 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | During the environmental monitoring activity in the area, the employee was surprised by a swarming swarm of weevils. During the exit of the place, endured suffering two stings, being one in the face and the other in the middle finger of the left hand. | 0 | 0 | Winter | 2 | 2017 | 7 | 5 |
| 423 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | The Employee performed the activity of stripping cathodes, when pulling the cathode sheet his hand hit the side of another cathode, causing a blunt cut on his 2nd finger of the left hand. | 0 | 0 | Winter | 3 | 2017 | 7 | 6 |
| 424 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | At 10:00 a.m., when the assistant cleaned the floor of module "E" in the central camp, she slipped back and immediately grabbed the laundry table to avoid falling to the floor; suffering the described injury. | 1 | 0 | Winter | 6 | 2017 | 7 | 9 |
418 rows × 16 columns
# @title Accident Level and Potential Accident Level vs Holidays and Non Holidays.
# Group the data and count accidents for each combination
holiday_accidents = ISH_df_preprocess.groupby(['Holiday', 'Accident Level'])['Accident Level'].count().unstack().fillna(0)
holiday_potential_accidents = ISH_df_preprocess.groupby(['Holiday', 'Potential Accident Level'])['Potential Accident Level'].count().unstack().fillna(0)
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Plot Holiday vs Accident Level
holiday_accidents.plot(kind='bar', stacked=True, ax=axes[0], cmap='Set3')
axes[0].set_title('Holiday vs Accident Level')
axes[0].set_xlabel('Holiday')
axes[0].set_ylabel('Number of Accidents')
axes[0].set_xticklabels(['Non-Holiday', 'Holiday'], rotation=0)
# Plot Holiday vs Potential Accident Level
holiday_potential_accidents.plot(kind='bar', stacked=True, ax=axes[1], cmap='Set3')
axes[1].set_title('Holiday vs Potential Accident Level')
axes[1].set_xlabel('Holiday')
axes[1].set_ylabel('Number of Accidents')
axes[1].set_xticklabels(['Non-Holiday', 'Holiday'], rotation=0)
plt.tight_layout()
plt.show()
Holiday vs Accident Level:
Holiday vs Potential Accident Level:
Overall:
# @title Critical Risks vs Holidays and Non Holidays.
# Group the data and count accidents for each combination
holiday_critical_risks = ISH_df_preprocess.groupby(['Holiday', 'Critical Risk'])['Critical Risk'].count().unstack().fillna(0)
# Plot Holiday vs Critical Risk using a grouped bar chart
holiday_critical_risks.plot(kind='bar', figsize=(15, 10), cmap='Set3')
plt.title('Holiday vs Critical Risk')
plt.xlabel('Holiday')
plt.ylabel('Number of Occurrences')
plt.xticks([0, 1], ['Non-Holiday', 'Holiday'], rotation=0)
plt.tight_layout()
plt.show()
Environmental Risk:
Health and Safety Risk:
Process Safety Risk:
Other Risks:
Overall:
# @title Season vs Accident Levels, Potential Accident Levels
# Season vs Accident Level
plt.figure(figsize=(10, 6))
sns.countplot(x='Season', hue='Accident Level', data=ISH_df_preprocess, palette='Set2')
plt.title('Season vs Accident Level')
plt.show()
# Season vs Potential Accident Level
plt.figure(figsize=(10, 6))
sns.countplot(x='Season', hue='Potential Accident Level', data=ISH_df_preprocess, palette='Set2')
plt.title('Season vs Potential Accident Level')
plt.show()
Season vs Accident Level:
Season vs Potential Accident Level:
# @title Season vs Critical Risk
plt.figure(figsize=(12, 18))
sns.countplot(y='Critical Risk', hue='Season', data=ISH_df_preprocess, palette='Set2')
plt.title('Critical Risk vs Season')
plt.show()
Critical Risk vs Season:
# @title Potential Accident Level vs Weekend
from matplotlib import pyplot as plt
import seaborn as sns
figsize = (12, 1.2 * len(ISH_df_preprocess['Potential Accident Level'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(ISH_df_preprocess, x='Weekend', y='Potential Accident Level', inner='stick', palette='Set2')
sns.despine(top=True, right=True, bottom=True, left=True)
Weekends vs Weekdays:
Potential Accident Level I:
Higher Potential Accident Levels:
Further Analysis:
# @title Accident Level vs Weekend
from matplotlib import pyplot as plt
import seaborn as sns
figsize = (12, 1.2 * len(ISH_df_preprocess['Accident Level'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(ISH_df_preprocess, x='Weekend', y='Accident Level', inner='stick', palette='Set2')
sns.despine(top=True, right=True, bottom=True, left=True)
Weekend vs Weekday Accidents:
Consistent Severity:
Potential Factors:
Based on the provided visualizations and analysis, the following attributes appear to have minimal impact and could potentially be dropped:
Weekend: The analysis suggests that the occurrence and severity of accidents are not significantly influenced by whether it's a weekend or a weekday.
Season: While there are some minor variations in critical risks across seasons, the overall distribution of accidents and their potential severity appear relatively consistent across seasons.
# Dropping Season and Weekend
ISH_df_preprocess = ISH_df_preprocess.drop(['Season', 'Weekend', 'Holiday'], axis=1)
ISH_df_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | DayOfWeek | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. | 4 | 2016 | 1 | 1 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. | 5 | 2016 | 1 | 2 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. | 2 | 2016 | 1 | 6 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. | 4 | 2016 | 1 | 8 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. | 6 | 2016 | 1 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 420 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | Being approximately 5:00 a.m. approximately, when lifting the Kelly HQ towards the pulley of the frame to align it, the assistant Marco that is in the later one is struck the hand against the frame generating the injury. | 1 | 2017 | 7 | 4 |
| 421 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | The collaborator moved from the infrastructure office (Julio to the toilets, when the pin of the right shoe is hooked on the bra of the left shoe causing not to take the step and fall untimely, causing injury described. | 1 | 2017 | 7 | 4 |
| 422 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | During the environmental monitoring activity in the area, the employee was surprised by a swarming swarm of weevils. During the exit of the place, endured suffering two stings, being one in the face and the other in the middle finger of the left hand. | 2 | 2017 | 7 | 5 |
| 423 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | The Employee performed the activity of stripping cathodes, when pulling the cathode sheet his hand hit the side of another cathode, causing a blunt cut on his 2nd finger of the left hand. | 3 | 2017 | 7 | 6 |
| 424 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | At 10:00 a.m., when the assistant cleaned the floor of module "E" in the central camp, she slipped back and immediately grabbed the laundry table to avoid falling to the floor; suffering the described injury. | 6 | 2017 | 7 | 9 |
418 rows × 13 columns
Pre NLP check for frequently occuring Words and Phrases
ISH_df_preprocess.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/ISH_df_preprocess.csv', index=False)
from collections import Counter
import re
import nltk
from nltk.corpus import stopwords
# Ensure stopwords are downloaded
nltk.download('stopwords')
# Function to clean and tokenize descriptions
def tokenize(text):
# Use a regular expression to find words that are purely alphabetic
tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
# Filter out stopwords
stop_words = set(stopwords.words('english'))
return [word for word in tokens if word not in stop_words]
# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Tokenize each description and create a flat list of all words
all_words = [word for description in ISH_df_preprocess['Description'] for word in tokenize(description)]
# Count the frequency of each word
word_counts = Counter(all_words)
# Display the most common words to get insights for categorizing accidents
word_counts.most_common(50)
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
[('causing', 166),
('hand', 163),
('employee', 156),
('left', 155),
('right', 154),
('operator', 126),
('injury', 104),
('time', 101),
('activity', 91),
('area', 80),
('moment', 78),
('equipment', 77),
('work', 76),
('accident', 73),
('collaborator', 71),
('level', 70),
('worker', 70),
('assistant', 68),
('finger', 68),
('pipe', 67),
('one', 65),
('floor', 65),
('support', 58),
('mesh', 58),
('rock', 54),
('safety', 53),
('mr', 53),
('approximately', 50),
('meters', 47),
('height', 46),
('described', 45),
('part', 44),
('team', 44),
('side', 43),
('injured', 42),
('truck', 42),
('face', 42),
('used', 42),
('kg', 40),
('circumstances', 39),
('cut', 39),
('gloves', 39),
('pump', 38),
('hit', 38),
('metal', 38),
('performing', 37),
('medical', 37),
('towards', 37),
('using', 35),
('made', 34)]
# Function to tokenize descriptions, filtering out numbers and special characters
def tokenize(text):
# Regular expression to find words that are purely alphabetic
tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
# Filter out stopwords
stop_words = set(stopwords.words('english'))
return [word for word in tokens if word not in stop_words]
# Function to find phrases that might indicate new categories
def find_phrases(text, length=2):
tokens = tokenize(text)
return [' '.join(tokens[i:i+length]) for i in range(len(tokens) - length + 1)]
# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Generate bi-grams (two-word phrases) from descriptions
bi_grams = [phrase for description in ISH_df_preprocess['Description'] for phrase in find_phrases(description, 2)]
# Count the frequency of each bi-gram
bi_gram_counts = Counter(bi_grams)
# Display the most common bi-grams to get insights for new categorizing accidents
bi_gram_counts.most_common(50)
[('left hand', 70),
('right hand', 57),
('time accident', 56),
('causing injury', 51),
('finger left', 22),
('employee reports', 22),
('injury described', 18),
('medical center', 17),
('described injury', 17),
('left foot', 15),
('injured person', 15),
('hand causing', 14),
('support mesh', 14),
('injury time', 14),
('right side', 13),
('finger right', 13),
('da silva', 13),
('allergic reaction', 13),
('right leg', 11),
('safety gloves', 11),
('made use', 10),
('fragment rock', 10),
('wearing safety', 10),
('time event', 10),
('right foot', 9),
('split set', 9),
('upper part', 9),
('left leg', 9),
('middle finger', 9),
('height meters', 9),
('ring finger', 9),
('left side', 9),
('accident employee', 9),
('weight kg', 8),
('generating injury', 8),
('causing cut', 8),
('generating described', 8),
('metal structure', 8),
('work area', 8),
('kg weight', 7),
('transferred medical', 7),
('master loader', 7),
('worker wearing', 7),
('index finger', 7),
('piece rock', 7),
('employee performing', 7),
('x cm', 7),
('lesion described', 7),
('used safety', 7),
('described time', 7)]
# Function to tokenize descriptions, filtering out numbers and special characters
def tokenize(text):
# Regular expression to find words that are purely alphabetic
tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
# Filter out stopwords
stop_words = set(stopwords.words('english'))
return [word for word in tokens if word not in stop_words]
# Function to find phrases that might indicate new categories
def find_phrases(text, length=3): # Adjust length default to 3 for trigrams
tokens = tokenize(text)
return [' '.join(tokens[i:i+length]) for i in range(len(tokens) - length + 1)]
# Assuming ISH_df_preprocess['Description'] contains the descriptions
# Generate trigrams (three-word phrases) from descriptions
tri_grams = [phrase for description in ISH_df_preprocess['Description'] for phrase in find_phrases(description)]
# Count the frequency of each trigram
tri_gram_counts = Counter(tri_grams)
# Display the most common trigrams to get insights for new categorizing accidents
tri_gram_counts.most_common(50)
[('finger left hand', 21),
('causing injury described', 13),
('finger right hand', 13),
('injury time accident', 13),
('generating described injury', 8),
('time accident employee', 8),
('hand causing injury', 7),
('described time accident', 7),
('left hand causing', 6),
('right hand causing', 6),
('back right hand', 5),
('worker wearing safety', 5),
('causing described injury', 5),
('cm x cm', 5),
('causing injury time', 5),
('returned normal activities', 5),
('manoel da silva', 5),
('approximately nv cx', 4),
('time accident worker', 4),
('accident worker wearing', 4),
('wearing safety gloves', 4),
('medical center attention', 4),
('made use safety', 4),
('used safety glasses', 4),
('generating injury time', 4),
('described injury time', 4),
('thermal recovery boiler', 4),
('verified type allergic', 4),
('type allergic reaction', 4),
('allergic reaction returned', 4),
('reaction returned normal', 4),
('generating lesion described', 4),
('place clerk wearing', 4),
('hand generating described', 4),
('employee reports performed', 4),
('hitting palm left', 3),
('palm left hand', 3),
('time fragment rock', 3),
('floor causing injury', 3),
('worker time accident', 3),
('transferred medical center', 3),
('little finger left', 3),
('index finger right', 3),
('type safety gloves', 3),
('circumstances two workers', 3),
('crown piece rock', 3),
('time event collaborator', 3),
('causing blunt cut', 3),
('use safety belt', 3),
('heavy equipment operator', 3)]
from wordcloud import WordCloud
# Create wordcloud for unigrams
wordcloud_unigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts)
# Create wordcloud for bigrams
wordcloud_bigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(bi_gram_counts)
# Create wordcloud for trigrams
wordcloud_trigrams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(tri_gram_counts)
# Display the generated wordclouds
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_unigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Unigram Wordcloud")
plt.show()
plt.subplots_adjust(hspace=1) # Adjust vertical spacing
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_bigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Bigram Wordcloud")
plt.show()
plt.subplots_adjust(hspace=1) # Adjust vertical spacing
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_trigrams, interpolation='bilinear')
plt.axis("off")
plt.title("Trigram Wordcloud")
plt.show()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /root/nltk_data...
True
# Load the dataset
ISH_NLP_preprocess = pd.read_csv('/content/drive/My Drive/Capstone_Group10_NLP1/ISH_df_preprocess.csv')
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords and lemmatize
cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
# Join the tokens back into a string
cleaned_text = ' '.join(cleaned_tokens)
return cleaned_text
# Apply preprocessing to the Description column
ISH_NLP_preprocess['Cleaned_Description'] = ISH_NLP_preprocess['Description'].apply(preprocess_text)
# Display the first few rows of the original and cleaned descriptions
ISH_NLP_preprocess[['Description', 'Cleaned_Description']].head()
# Save the number of words before and after cleaning
ISH_NLP_preprocess['Original_Word_Count'] = ISH_NLP_preprocess['Description'].apply(lambda x: len(str(x).split()))
ISH_NLP_preprocess['Cleaned_Word_Count'] = ISH_NLP_preprocess['Cleaned_Description'].apply(lambda x: len(str(x).split()))
ISH_NLP_preprocess[['Description', 'Cleaned_Description']].head()
| Description | Cleaned_Description | |
|---|---|---|
| 0 | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo |
| 1 | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter |
| 2 | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury |
| 3 | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury |
| 4 | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described |
ISH_NLP_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | Description | DayOfWeek | Year | Month | Day | Cleaned_Description | Original_Word_Count | Cleaned_Word_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo. | 4 | 2016 | 1 | 1 | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo | 80 | 37 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter. | 5 | 2016 | 1 | 2 | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter | 54 | 27 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel tip of the safety shoe and then the metatarsal area of the left foot of the collaborator causing the injury. | 2 | 2016 | 1 | 6 | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury | 57 | 28 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 CX-695 OB7, the personnel begins the task of unlocking the Soquet bolts of the BHB machine, when they were in the penultimate bolt they identified that the hexagonal head was worn, proceeding Mr. Cristóbal - Auxiliary assistant to climb to the platform to exert pressure with your hand on the "DADO" key, to prevent it from coming out of the bolt; in those moments two collaborators rotate with the lever in anti-clockwise direction, leaving the key of the bolt, hitting the palm of the left hand, causing the injury. | 4 | 2016 | 1 | 8 | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury | 97 | 49 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances that the mechanics Anthony (group leader), Eduardo and Eric Fernández-injured-the three of the Company IMPROMEC, performed the removal of the pulley of the motor of the pump 3015 in the ZAF of Marcy. 27 cm / Length: 33 cm / Weight: 70 kg), as it was locked proceed to heating the pulley to loosen it, it comes out and falls from a distance of 1.06 meters high and hits the instep of the right foot of the worker, causing the injury described. | 6 | 2016 | 1 | 10 | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described | 88 | 42 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | Being approximately 5:00 a.m. approximately, when lifting the Kelly HQ towards the pulley of the frame to align it, the assistant Marco that is in the later one is struck the hand against the frame generating the injury. | 1 | 2017 | 7 | 4 | approximately approximately lifting kelly hq towards pulley frame align assistant marco later one struck hand frame generating injury | 38 | 18 |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | The collaborator moved from the infrastructure office (Julio to the toilets, when the pin of the right shoe is hooked on the bra of the left shoe causing not to take the step and fall untimely, causing injury described. | 1 | 2017 | 7 | 4 | collaborator moved infrastructure office julio toilet pin right shoe hooked bra left shoe causing take step fall untimely causing injury described | 39 | 21 |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | During the environmental monitoring activity in the area, the employee was surprised by a swarming swarm of weevils. During the exit of the place, endured suffering two stings, being one in the face and the other in the middle finger of the left hand. | 2 | 2017 | 7 | 5 | environmental monitoring activity area employee surprised swarming swarm weevil exit place endured suffering two sting one face middle finger left hand | 44 | 21 |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | The Employee performed the activity of stripping cathodes, when pulling the cathode sheet his hand hit the side of another cathode, causing a blunt cut on his 2nd finger of the left hand. | 3 | 2017 | 7 | 6 | employee performed activity stripping cathode pulling cathode sheet hand hit side another cathode causing blunt cut nd finger left hand | 33 | 20 |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | At 10:00 a.m., when the assistant cleaned the floor of module "E" in the central camp, she slipped back and immediately grabbed the laundry table to avoid falling to the floor; suffering the described injury. | 6 | 2017 | 7 | 9 | assistant cleaned floor module e central camp slipped back immediately grabbed laundry table avoid falling floor suffering described injury | 35 | 19 |
418 rows × 16 columns
# Calculate and print the average word count before and after cleaning
avg_original = ISH_NLP_preprocess['Original_Word_Count'].mean()
avg_cleaned = ISH_NLP_preprocess['Cleaned_Word_Count'].mean()
print(f"\nAverage word count before cleaning: {avg_original:.2f}")
print(f"Average word count after cleaning: {avg_cleaned:.2f}")
print(f"Reduction in words: {(avg_original - avg_cleaned) / avg_original * 100:.2f}%")
Average word count before cleaning: 65.06 Average word count after cleaning: 32.80 Reduction in words: 49.58%
# Removing the repetitive and unnecessary columns which is not required for analysis
Unnecessary_Columns = ['Description','Original_Word_Count','Cleaned_Word_Count']
# Drop unnecessary columns
ISH_NLP_preprocess = ISH_NLP_preprocess.drop(Unnecessary_Columns, axis=1)
ISH_NLP_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | Month | Day | Cleaned_Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | 1 | 1 | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | 1 | 2 | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | 1 | 6 | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | 1 | 8 | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | 1 | 10 | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | 7 | 4 | approximately approximately lifting kelly hq towards pulley frame align assistant marco later one struck hand frame generating injury |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | 7 | 4 | collaborator moved infrastructure office julio toilet pin right shoe hooked bra left shoe causing take step fall untimely causing injury described |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | 7 | 5 | environmental monitoring activity area employee surprised swarming swarm weevil exit place endured suffering two sting one face middle finger left hand |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | 7 | 6 | employee performed activity stripping cathode pulling cathode sheet hand hit side another cathode causing blunt cut nd finger left hand |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | 7 | 9 | assistant cleaned floor module e central camp slipped back immediately grabbed laundry table avoid falling floor suffering described injury |
418 rows × 13 columns
# Rename Cleaned Desription to Description
ISH_NLP_preprocess = ISH_NLP_preprocess.rename(columns={'Cleaned_Description': 'Description'})
ISH_NLP_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | Month | Day | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | 1 | 1 | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | 1 | 2 | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | 1 | 6 | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | 1 | 8 | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | 1 | 10 | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | 7 | 4 | approximately approximately lifting kelly hq towards pulley frame align assistant marco later one struck hand frame generating injury |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | 7 | 4 | collaborator moved infrastructure office julio toilet pin right shoe hooked bra left shoe causing take step fall untimely causing injury described |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | 7 | 5 | environmental monitoring activity area employee surprised swarming swarm weevil exit place endured suffering two sting one face middle finger left hand |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | 7 | 6 | employee performed activity stripping cathode pulling cathode sheet hand hit side another cathode causing blunt cut nd finger left hand |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | 7 | 9 | assistant cleaned floor module e central camp slipped back immediately grabbed laundry table avoid falling floor suffering described injury |
418 rows × 13 columns
# Save the preprocessed data
ISH_NLP_preprocess.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/ISH_NLP_preprocess.csv', index=False)
from collections import Counter
# Load the preprocessed data
ISH_NLP_preprocess = pd.read_csv('/content/drive/My Drive/Capstone_Group10_NLP1/ISH_NLP_preprocess.csv')
# Combine all descriptions into a single string
all_text = ' '.join(ISH_NLP_preprocess['Description'].astype(str))
# Tokenize the combined text
tokens = word_tokenize(all_text)
# Calculate token distribution
token_counts = Counter(tokens)
# Create a dataframe from the most common words
top_words_df = pd.DataFrame(token_counts.most_common(30), columns=['Word', 'Count'])
# Display the dataframe
top_words_df
| Word | Count | |
|---|---|---|
| 0 | hand | 177 |
| 1 | employee | 172 |
| 2 | causing | 166 |
| 3 | left | 155 |
| 4 | right | 154 |
| 5 | operator | 132 |
| 6 | activity | 117 |
| 7 | time | 112 |
| 8 | injury | 110 |
| 9 | moment | 101 |
| 10 | worker | 84 |
| 11 | collaborator | 81 |
| 12 | area | 80 |
| 13 | work | 79 |
| 14 | equipment | 77 |
| 15 | finger | 76 |
| 16 | assistant | 75 |
| 17 | accident | 73 |
| 18 | pipe | 71 |
| 19 | level | 70 |
| 20 | hit | 70 |
| 21 | one | 66 |
| 22 | floor | 65 |
| 23 | support | 62 |
| 24 | mesh | 59 |
| 25 | rock | 56 |
| 26 | fall | 55 |
| 27 | safety | 53 |
| 28 | mr | 53 |
| 29 | cm | 53 |
# @title Wordcloud for N-Grams
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Combine all descriptions into a single string
all_text = ' '.join(ISH_NLP_preprocess['Description'].astype(str))
# Generate word cloud for unigrams
wordcloud_unigrams = WordCloud(width=800, height=400, background_color='white').generate(all_text)
# Generate word cloud for bigrams
bigrams = nltk.bigrams(word_tokenize(all_text))
bigram_text = ' '.join(['_'.join(bigram) for bigram in bigrams])
wordcloud_bigrams = WordCloud(width=800, height=400, background_color='white').generate(bigram_text)
# Generate word cloud for trigrams
trigrams = nltk.trigrams(word_tokenize(all_text))
trigram_text = ' '.join(['_'.join(trigram) for trigram in trigrams])
wordcloud_trigrams = WordCloud(width=800, height=400, background_color='white').generate(trigram_text)
# Display the word clouds
plt.figure(figsize=(45, 15))
plt.subplot(1, 3, 1)
plt.imshow(wordcloud_unigrams, interpolation='bilinear')
plt.title('Unigrams')
plt.axis('off')
plt.subplot(1, 3, 2)
plt.imshow(wordcloud_bigrams, interpolation='bilinear')
plt.title('Bigrams')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.imshow(wordcloud_trigrams, interpolation='bilinear')
plt.title('Trigrams')
plt.axis('off')
plt.show()
Unigrams:
Bigrams:
Trigrams:
Overall:
ISH_NLP_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | Month | Day | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | 1 | 1 | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | 1 | 2 | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | 1 | 6 | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | 1 | 8 | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | 1 | 10 | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | 7 | 4 | approximately approximately lifting kelly hq towards pulley frame align assistant marco later one struck hand frame generating injury |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | 7 | 4 | collaborator moved infrastructure office julio toilet pin right shoe hooked bra left shoe causing take step fall untimely causing injury described |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | 7 | 5 | environmental monitoring activity area employee surprised swarming swarm weevil exit place endured suffering two sting one face middle finger left hand |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | 7 | 6 | employee performed activity stripping cathode pulling cathode sheet hand hit side another cathode causing blunt cut nd finger left hand |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | 7 | 9 | assistant cleaned floor module e central camp slipped back immediately grabbed laundry table avoid falling floor suffering described injury |
418 rows × 13 columns
import pandas as pd
import re
# Function to preprocess and tokenize descriptions
def preprocess_and_tokenize(description):
# Convert to lowercase
description = description.lower()
# Remove punctuation and non-alphabetic characters
description = re.sub(r'[^a-z\s]', '', description)
# Tokenize (split by whitespace)
words = description.split()
return words
# Apply the preprocessing function
ISH_NLP_preprocess['tokenized_words'] = ISH_NLP_preprocess['Description'].apply(preprocess_and_tokenize)
ISH_NLP_preprocess
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | Month | Day | Description | tokenized_words | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | 1 | 1 | removing drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal seeing mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo | [removing, drill, rod, jumbo, maintenance, supervisor, proceeds, loosen, support, intermediate, centralizer, facilitate, removal, seeing, mechanic, support, one, end, drill, equipment, pull, hand, bar, accelerate, removal, moment, bar, slide, point, support, tightens, finger, mechanic, drilling, bar, beam, jumbo] |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | 1 | 2 | activation sodium sulphide pump piping uncoupled sulfide solution designed area reach maid immediately made use emergency shower directed ambulatory doctor later hospital note sulphide solution gram liter | [activation, sodium, sulphide, pump, piping, uncoupled, sulfide, solution, designed, area, reach, maid, immediately, made, use, emergency, shower, directed, ambulatory, doctor, later, hospital, note, sulphide, solution, gram, liter] |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | 1 | 6 | substation milpo located level collaborator excavation work pick hand tool hitting rock flat part beak bounce hitting steel tip safety shoe metatarsal area left foot collaborator causing injury | [substation, milpo, located, level, collaborator, excavation, work, pick, hand, tool, hitting, rock, flat, part, beak, bounce, hitting, steel, tip, safety, shoe, metatarsal, area, left, foot, collaborator, causing, injury] |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | 1 | 8 | approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identified hexagonal head worn proceeding mr cristbal auxiliary assistant climb platform exert pressure hand dado key prevent coming bolt moment two collaborator rotate lever anticlockwise direction leaving key bolt hitting palm left hand causing injury | [approximately, nv, cx, ob, personnel, begin, task, unlocking, soquet, bolt, bhb, machine, penultimate, bolt, identified, hexagonal, head, worn, proceeding, mr, cristbal, auxiliary, assistant, climb, platform, exert, pressure, hand, dado, key, prevent, coming, bolt, moment, two, collaborator, rotate, lever, anticlockwise, direction, leaving, key, bolt, hitting, palm, left, hand, causing, injury] |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | 1 | 10 | approximately circumstance mechanic anthony group leader eduardo eric fernndezinjuredthe three company impromec performed removal pulley motor pump zaf marcy cm length cm weight kg locked proceed heating pulley loosen come fall distance meter high hit instep right foot worker causing injury described | [approximately, circumstance, mechanic, anthony, group, leader, eduardo, eric, fernndezinjuredthe, three, company, impromec, performed, removal, pulley, motor, pump, zaf, marcy, cm, length, cm, weight, kg, locked, proceed, heating, pulley, loosen, come, fall, distance, meter, high, hit, instep, right, foot, worker, causing, injury, described] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | 7 | 4 | approximately approximately lifting kelly hq towards pulley frame align assistant marco later one struck hand frame generating injury | [approximately, approximately, lifting, kelly, hq, towards, pulley, frame, align, assistant, marco, later, one, struck, hand, frame, generating, injury] |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | 7 | 4 | collaborator moved infrastructure office julio toilet pin right shoe hooked bra left shoe causing take step fall untimely causing injury described | [collaborator, moved, infrastructure, office, julio, toilet, pin, right, shoe, hooked, bra, left, shoe, causing, take, step, fall, untimely, causing, injury, described] |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | 7 | 5 | environmental monitoring activity area employee surprised swarming swarm weevil exit place endured suffering two sting one face middle finger left hand | [environmental, monitoring, activity, area, employee, surprised, swarming, swarm, weevil, exit, place, endured, suffering, two, sting, one, face, middle, finger, left, hand] |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | 7 | 6 | employee performed activity stripping cathode pulling cathode sheet hand hit side another cathode causing blunt cut nd finger left hand | [employee, performed, activity, stripping, cathode, pulling, cathode, sheet, hand, hit, side, another, cathode, causing, blunt, cut, nd, finger, left, hand] |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | 7 | 9 | assistant cleaned floor module e central camp slipped back immediately grabbed laundry table avoid falling floor suffering described injury | [assistant, cleaned, floor, module, e, central, camp, slipped, back, immediately, grabbed, laundry, table, avoid, falling, floor, suffering, described, injury] |
418 rows × 14 columns
ISH_NLP_preprocess.shape
(418, 14)
ISH_NLP_preprocess.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 418 non-null object 1 City 418 non-null object 2 Industry Sector 418 non-null object 3 Accident Level 418 non-null object 4 Potential Accident Level 418 non-null object 5 Gender 418 non-null object 6 Employee Type 418 non-null object 7 Critical Risk 418 non-null object 8 DayOfWeek 418 non-null int64 9 Year 418 non-null int64 10 Month 418 non-null int64 11 Day 418 non-null int64 12 Description 418 non-null object 13 tokenized_words 418 non-null object dtypes: int64(4), object(10) memory usage: 45.8+ KB
ISH_NLP_preprocess1 = ISH_NLP_preprocess.copy()
Generating Word Embeddings over 'Description' column using Glove, TFI-DF and Word2Vec
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
def generate_embedding_dataframes(df):
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
# 1. GloVe Embeddings
def load_glove_model(glove_file):
embedding_dict = {}
with open(glove_file, 'r', encoding="utf8") as f:
for line in f:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], "float32")
embedding_dict[word] = vector
return embedding_dict
def get_average_glove_embeddings(tokenized_words, embedding_dict, embedding_dim=300):
embeddings = [embedding_dict.get(word, np.zeros(embedding_dim)) for word in tokenized_words]
return np.mean(embeddings, axis=0) if embeddings else np.zeros(embedding_dim)
# Load GloVe model and generate GloVe embeddings
glove_file = '/content/drive/MyDrive/Capstone_Group10_NLP1/glove.6B/glove.6B.300d.txt'
glove_embeddings = load_glove_model(glove_file)
glove_embeddings_series = df1['tokenized_words'].apply(lambda words: get_average_glove_embeddings(words, glove_embeddings))
ISH_NLP_Glove_df = pd.concat([df1.drop(columns=['tokenized_words']), pd.DataFrame(glove_embeddings_series.tolist(), columns=[f'GloVe_{i}' for i in range(300)])], axis=1)
# 2. TF-IDF Features
tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False, token_pattern=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(df2['tokenized_words'])
# Create a DataFrame with TF-IDF features
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
ISH_NLP_TFIDF_df = pd.concat([df2.drop(columns=['tokenized_words']), tfidf_df], axis=1)
# 3. Word2Vec Embeddings
word2vec_model = Word2Vec(sentences=df3['tokenized_words'], vector_size=300, window=5, min_count=1, workers=4)
def get_average_word2vec_embeddings(tokenized_words, model, embedding_dim=300):
embeddings = [model.wv[word] for word in tokenized_words if word in model.wv]
return np.mean(embeddings, axis=0) if embeddings else np.zeros(embedding_dim)
word2vec_embeddings_series = df3['tokenized_words'].apply(lambda words: get_average_word2vec_embeddings(words, word2vec_model))
ISH_NLP_Word2Vec_df = pd.concat([df3.drop(columns=['tokenized_words']), pd.DataFrame(word2vec_embeddings_series.tolist(), columns=[f'Word2Vec_{i}' for i in range(300)])], axis=1)
return ISH_NLP_Glove_df, ISH_NLP_TFIDF_df, ISH_NLP_Word2Vec_df
# Use the function to generate the DataFrames
ISH_NLP_Glove_df, ISH_NLP_TFIDF_df, ISH_NLP_Word2Vec_df = generate_embedding_dataframes(ISH_NLP_preprocess1)
ISH_NLP_Glove_df
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | ... | GloVe_290 | GloVe_291 | GloVe_292 | GloVe_293 | GloVe_294 | GloVe_295 | GloVe_296 | GloVe_297 | GloVe_298 | GloVe_299 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | ... | -0.034536 | -0.110637 | -0.085788 | -0.031955 | 0.008084 | 0.205297 | -0.001389 | -0.296468 | -0.061921 | -0.003529 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | ... | -0.412660 | -0.135541 | 0.049905 | 0.032907 | 0.103431 | -0.155970 | 0.078383 | -0.218822 | -0.099618 | -0.053435 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | ... | 0.005927 | -0.135486 | -0.016369 | 0.125184 | 0.149826 | 0.194006 | 0.028868 | -0.159949 | 0.032494 | -0.110724 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | ... | -0.037377 | -0.070661 | 0.078244 | -0.019498 | -0.035796 | 0.246286 | -0.105964 | -0.115616 | -0.050545 | -0.049797 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | ... | 0.103048 | -0.080292 | 0.028120 | -0.075642 | 0.116875 | 0.247585 | -0.008106 | -0.106944 | -0.074254 | -0.087914 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | ... | -0.048683 | -0.039020 | -0.071929 | -0.091603 | 0.107000 | 0.385754 | -0.140584 | -0.078597 | 0.143009 | -0.130202 |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | ... | 0.049501 | -0.147315 | 0.041269 | 0.039820 | 0.083148 | 0.199192 | -0.086235 | -0.224753 | 0.005231 | -0.024155 |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | ... | 0.058225 | -0.122102 | -0.121571 | 0.074627 | 0.131929 | 0.145566 | 0.031812 | 0.011314 | -0.088791 | -0.089753 |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | ... | -0.095062 | -0.107262 | 0.079336 | 0.124554 | 0.068740 | 0.040127 | 0.048653 | -0.123861 | 0.090110 | -0.117909 |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | ... | 0.028054 | 0.010017 | -0.083869 | -0.013579 | 0.174762 | 0.119727 | 0.049611 | -0.257038 | -0.052309 | -0.065951 |
418 rows × 313 columns
ISH_NLP_TFIDF_df
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | ... | yolk | young | z | zaf | zamac | zero | zinc | zinco | zn | zone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | ... | 0.0 | 0.0 | 0.0 | 0.200191 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | ... | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
418 rows × 2827 columns
ISH_NLP_Word2Vec_df
| Country | City | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee Type | Critical Risk | DayOfWeek | Year | ... | Word2Vec_290 | Word2Vec_291 | Word2Vec_292 | Word2Vec_293 | Word2Vec_294 | Word2Vec_295 | Word2Vec_296 | Word2Vec_297 | Word2Vec_298 | Word2Vec_299 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Country_01 | City_01 | Mining | I | IV | Male | Third Party | Pressed | 4 | 2016 | ... | -0.000184 | 0.008809 | 0.006719 | -0.000653 | 0.008435 | 0.008823 | -0.002145 | -0.005005 | 0.004047 | -0.001308 |
| 1 | Country_02 | City_02 | Mining | I | IV | Male | Employee | Pressurized Systems | 5 | 2016 | ... | -0.000224 | 0.003050 | 0.002841 | -0.000295 | 0.002847 | 0.003351 | -0.000023 | -0.001519 | 0.001563 | 0.000062 |
| 2 | Country_01 | City_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | 2 | 2016 | ... | 0.000299 | 0.009057 | 0.007054 | -0.000644 | 0.007221 | 0.010194 | -0.001305 | -0.004769 | 0.003950 | -0.001413 |
| 3 | Country_01 | City_04 | Mining | I | I | Male | Third Party | Others | 4 | 2016 | ... | -0.000157 | 0.006961 | 0.005656 | -0.000583 | 0.006393 | 0.007921 | -0.000922 | -0.004311 | 0.003201 | -0.001234 |
| 4 | Country_01 | City_04 | Mining | IV | IV | Male | Third Party | Others | 6 | 2016 | ... | -0.000690 | 0.007071 | 0.005584 | -0.000388 | 0.005998 | 0.007630 | -0.001644 | -0.003608 | 0.003611 | -0.000547 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 413 | Country_01 | City_04 | Mining | I | III | Male | Third Party | Others | 1 | 2017 | ... | -0.000250 | 0.009238 | 0.005928 | -0.000045 | 0.007040 | 0.008480 | -0.001309 | -0.004440 | 0.003802 | -0.001885 |
| 414 | Country_01 | City_03 | Mining | I | II | Female | Employee | Others | 1 | 2017 | ... | 0.000451 | 0.007068 | 0.005498 | -0.001052 | 0.006713 | 0.007056 | -0.000784 | -0.003366 | 0.003225 | -0.000547 |
| 415 | Country_02 | City_09 | Metals | I | II | Male | Employee | Venomous Animals | 2 | 2017 | ... | 0.000086 | 0.008377 | 0.006045 | -0.000146 | 0.008351 | 0.009401 | -0.001291 | -0.005152 | 0.003886 | -0.000631 |
| 416 | Country_02 | City_05 | Metals | I | II | Male | Employee | Cut | 3 | 2017 | ... | -0.001133 | 0.011386 | 0.007921 | -0.000351 | 0.011003 | 0.013270 | -0.001504 | -0.005928 | 0.005737 | -0.001309 |
| 417 | Country_01 | City_04 | Mining | I | II | Female | Third Party | Fall prevention (same level) | 6 | 2017 | ... | 0.000254 | 0.006694 | 0.004939 | -0.000633 | 0.005915 | 0.006429 | -0.000586 | -0.003666 | 0.003575 | -0.000935 |
418 rows × 313 columns
ISH_NLP_preprocess1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 418 non-null object 1 City 418 non-null object 2 Industry Sector 418 non-null object 3 Accident Level 418 non-null object 4 Potential Accident Level 418 non-null object 5 Gender 418 non-null object 6 Employee Type 418 non-null object 7 Critical Risk 418 non-null object 8 DayOfWeek 418 non-null int64 9 Year 418 non-null int64 10 Month 418 non-null int64 11 Day 418 non-null int64 12 Description 418 non-null object 13 tokenized_words 418 non-null object dtypes: int64(4), object(10) memory usage: 45.8+ KB
# Print shapes to confirm
print(ISH_NLP_Glove_df.shape)
print(ISH_NLP_TFIDF_df.shape)
print(ISH_NLP_Word2Vec_df.shape)
(418, 313) (418, 2827) (418, 313)
for dtype in ISH_NLP_Glove_df.dtypes.unique():
print(f"Columns of type {dtype}:")
print(ISH_NLP_Glove_df.select_dtypes(include=[dtype]).columns.tolist())
print()
Columns of type object: ['Country', 'City', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description'] Columns of type int64: ['DayOfWeek', 'Year', 'Month', 'Day'] Columns of type float64: ['GloVe_0', 'GloVe_1', 'GloVe_2', 'GloVe_3', 'GloVe_4', 'GloVe_5', 'GloVe_6', 'GloVe_7', 'GloVe_8', 'GloVe_9', 'GloVe_10', 'GloVe_11', 'GloVe_12', 'GloVe_13', 'GloVe_14', 'GloVe_15', 'GloVe_16', 'GloVe_17', 'GloVe_18', 'GloVe_19', 'GloVe_20', 'GloVe_21', 'GloVe_22', 'GloVe_23', 'GloVe_24', 'GloVe_25', 'GloVe_26', 'GloVe_27', 'GloVe_28', 'GloVe_29', 'GloVe_30', 'GloVe_31', 'GloVe_32', 'GloVe_33', 'GloVe_34', 'GloVe_35', 'GloVe_36', 'GloVe_37', 'GloVe_38', 'GloVe_39', 'GloVe_40', 'GloVe_41', 'GloVe_42', 'GloVe_43', 'GloVe_44', 'GloVe_45', 'GloVe_46', 'GloVe_47', 'GloVe_48', 'GloVe_49', 'GloVe_50', 'GloVe_51', 'GloVe_52', 'GloVe_53', 'GloVe_54', 'GloVe_55', 'GloVe_56', 'GloVe_57', 'GloVe_58', 'GloVe_59', 'GloVe_60', 'GloVe_61', 'GloVe_62', 'GloVe_63', 'GloVe_64', 'GloVe_65', 'GloVe_66', 'GloVe_67', 'GloVe_68', 'GloVe_69', 'GloVe_70', 'GloVe_71', 'GloVe_72', 'GloVe_73', 'GloVe_74', 'GloVe_75', 'GloVe_76', 'GloVe_77', 'GloVe_78', 'GloVe_79', 'GloVe_80', 'GloVe_81', 'GloVe_82', 'GloVe_83', 'GloVe_84', 'GloVe_85', 'GloVe_86', 'GloVe_87', 'GloVe_88', 'GloVe_89', 'GloVe_90', 'GloVe_91', 'GloVe_92', 'GloVe_93', 'GloVe_94', 'GloVe_95', 'GloVe_96', 'GloVe_97', 'GloVe_98', 'GloVe_99', 'GloVe_100', 'GloVe_101', 'GloVe_102', 'GloVe_103', 'GloVe_104', 'GloVe_105', 'GloVe_106', 'GloVe_107', 'GloVe_108', 'GloVe_109', 'GloVe_110', 'GloVe_111', 'GloVe_112', 'GloVe_113', 'GloVe_114', 'GloVe_115', 'GloVe_116', 'GloVe_117', 'GloVe_118', 'GloVe_119', 'GloVe_120', 'GloVe_121', 'GloVe_122', 'GloVe_123', 'GloVe_124', 'GloVe_125', 'GloVe_126', 'GloVe_127', 'GloVe_128', 'GloVe_129', 'GloVe_130', 'GloVe_131', 'GloVe_132', 'GloVe_133', 'GloVe_134', 'GloVe_135', 'GloVe_136', 'GloVe_137', 'GloVe_138', 'GloVe_139', 'GloVe_140', 'GloVe_141', 'GloVe_142', 'GloVe_143', 'GloVe_144', 'GloVe_145', 'GloVe_146', 'GloVe_147', 'GloVe_148', 'GloVe_149', 'GloVe_150', 'GloVe_151', 'GloVe_152', 'GloVe_153', 'GloVe_154', 'GloVe_155', 'GloVe_156', 'GloVe_157', 'GloVe_158', 'GloVe_159', 'GloVe_160', 'GloVe_161', 'GloVe_162', 'GloVe_163', 'GloVe_164', 'GloVe_165', 'GloVe_166', 'GloVe_167', 'GloVe_168', 'GloVe_169', 'GloVe_170', 'GloVe_171', 'GloVe_172', 'GloVe_173', 'GloVe_174', 'GloVe_175', 'GloVe_176', 'GloVe_177', 'GloVe_178', 'GloVe_179', 'GloVe_180', 'GloVe_181', 'GloVe_182', 'GloVe_183', 'GloVe_184', 'GloVe_185', 'GloVe_186', 'GloVe_187', 'GloVe_188', 'GloVe_189', 'GloVe_190', 'GloVe_191', 'GloVe_192', 'GloVe_193', 'GloVe_194', 'GloVe_195', 'GloVe_196', 'GloVe_197', 'GloVe_198', 'GloVe_199', 'GloVe_200', 'GloVe_201', 'GloVe_202', 'GloVe_203', 'GloVe_204', 'GloVe_205', 'GloVe_206', 'GloVe_207', 'GloVe_208', 'GloVe_209', 'GloVe_210', 'GloVe_211', 'GloVe_212', 'GloVe_213', 'GloVe_214', 'GloVe_215', 'GloVe_216', 'GloVe_217', 'GloVe_218', 'GloVe_219', 'GloVe_220', 'GloVe_221', 'GloVe_222', 'GloVe_223', 'GloVe_224', 'GloVe_225', 'GloVe_226', 'GloVe_227', 'GloVe_228', 'GloVe_229', 'GloVe_230', 'GloVe_231', 'GloVe_232', 'GloVe_233', 'GloVe_234', 'GloVe_235', 'GloVe_236', 'GloVe_237', 'GloVe_238', 'GloVe_239', 'GloVe_240', 'GloVe_241', 'GloVe_242', 'GloVe_243', 'GloVe_244', 'GloVe_245', 'GloVe_246', 'GloVe_247', 'GloVe_248', 'GloVe_249', 'GloVe_250', 'GloVe_251', 'GloVe_252', 'GloVe_253', 'GloVe_254', 'GloVe_255', 'GloVe_256', 'GloVe_257', 'GloVe_258', 'GloVe_259', 'GloVe_260', 'GloVe_261', 'GloVe_262', 'GloVe_263', 'GloVe_264', 'GloVe_265', 'GloVe_266', 'GloVe_267', 'GloVe_268', 'GloVe_269', 'GloVe_270', 'GloVe_271', 'GloVe_272', 'GloVe_273', 'GloVe_274', 'GloVe_275', 'GloVe_276', 'GloVe_277', 'GloVe_278', 'GloVe_279', 'GloVe_280', 'GloVe_281', 'GloVe_282', 'GloVe_283', 'GloVe_284', 'GloVe_285', 'GloVe_286', 'GloVe_287', 'GloVe_288', 'GloVe_289', 'GloVe_290', 'GloVe_291', 'GloVe_292', 'GloVe_293', 'GloVe_294', 'GloVe_295', 'GloVe_296', 'GloVe_297', 'GloVe_298', 'GloVe_299']
for dtype in ISH_NLP_TFIDF_df.dtypes.unique():
print(f"Columns of type {dtype}:")
print(ISH_NLP_TFIDF_df.select_dtypes(include=[dtype]).columns.tolist())
print()
Columns of type object: ['Country', 'City', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description'] Columns of type int64: ['DayOfWeek', 'Year', 'Month', 'Day'] Columns of type float64: ['abb', 'abdomen', 'able', 'abratech', 'abrupt', 'abruptly', 'absorbent', 'absorbing', 'abutment', 'acc', 'accelerate', 'access', 'accessing', 'accessory', 'accident', 'accidentally', 'accidently', 'accommodate', 'accommodated', 'accompanied', 'accompanying', 'according', 'accretion', 'accumulated', 'accumulates', 'accumulating', 'accumulation', 'achieving', 'acid', 'acl', 'acquisition', 'across', 'acted', 'action', 'activated', 'activates', 'activation', 'activity', 'actuating', 'adapted', 'adapter', 'addition', 'additive', 'ademir', 'adhered', 'adhesion', 'adjoining', 'adjusted', 'adjusting', 'adjustment', 'adjutant', 'administrative', 'advance', 'advancing', 'aeq', 'aerial', 'affected', 'affecting', 'afo', 'aforementioned', 'afternoon', 'afterwards', 'aggregate', 'agitated', 'ago', 'ahead', 'ahk', 'aid', 'air', 'airlift', 'ajani', 'ajax', 'ajg', 'albertico', 'albino', 'alcohotest', 'alert', 'alex', 'alfredo', 'align', 'aligning', 'alimak', 'alimakero', 'alizado', 'allergic', 'allergy', 'allow', 'almost', 'alone', 'along', 'alpha', 'already', 'also', 'aluminum', 'ambulance', 'ambulatory', 'amg', 'ammonia', 'amount', 'amp', 'ampoloader', 'amputation', 'analysis', 'ancash', 'anchor', 'anchorage', 'anchored', 'anchoring', 'anfo', 'anfoloader', 'angle', 'ankle', 'anode', 'another', 'answer', 'antenna', 'anterior', 'anthony', 'antiallergic', 'anticlockwise', 'antnio', 'antonio', 'anything', 'apparent', 'apparently', 'appears', 'applied', 'applies', 'applying', 'approach', 'approaching', 'approx', 'approximate', 'approximately', 'aramid', 'arc', 'area', 'aripuan', 'arm', 'around', 'arrange', 'arranged', 'arranging', 'arrived', 'arrives', 'arriving', 'ask', 'asks', 'assemble', 'assembling', 'assembly', 'assigned', 'assist', 'assistant', 'assisted', 'assisting', 'assume', 'atenuz', 'atlas', 'atricion', 'atriction', 'attached', 'attaching', 'attack', 'attacked', 'attempt', 'attempting', 'attendant', 'attended', 'attending', 'attention', 'attributing', 'attrition', 'autoclave', 'automatic', 'auxiliar', 'auxiliary', 'averaging', 'avoid', 'avoiding', 'away', 'ax', 'b', 'back', 'backhoe', 'backwards', 'bag', 'balance', 'balancing', 'ball', 'balloon', 'band', 'bank', 'bap', 'bapdd', 'bar', 'barbed', 'barel', 'barretilla', 'base', 'basin', 'basket', 'bathroom', 'baton', 'battery', 'beak', 'beam', 'bearing', 'beating', 'became', 'becker', 'become', 'becomes', 'bee', 'beehive', 'beetle', 'began', 'begin', 'behind', 'believed', 'belly', 'belt', 'bench', 'bend', 'bending', 'bhb', 'big', 'bigbag', 'bigbags', 'bin', 'bine', 'bioxide', 'bit', 'bite', 'bitten', 'blackjack', 'bladder', 'blade', 'blanket', 'blaster', 'blasting', 'blind', 'block', 'blocked', 'blow', 'blower', 'blowing', 'blown', 'blunt', 'bo', 'board', 'boarding', 'bob', 'bodeguero', 'body', 'boiler', 'bolt', 'boltec', 'bolter', 'bolting', 'bomb', 'bonifacio', 'bonnet', 'bonsucesso', 'boom', 'boot', 'bore', 'borehole', 'bos', 'bothering', 'bottle', 'bottom', 'bounce', 'bouncing', 'bowl', 'box', 'bp', 'br', 'bra', 'brace', 'bracket', 'brake', 'braking', 'branch', 'brapdd', 'break', 'breaker', 'breaking', 'breeder', 'breno', 'brick', 'bricklayer', 'bridge', 'brigade', 'bring', 'brjcldd', 'broke', 'broken', 'bruise', 'bruised', 'brushcutters', 'brushed', 'bucket', 'building', 'bump', 'bumped', 'bundle', 'burn', 'burning', 'burr', 'burst', 'bus', 'bypass', 'c', 'cab', 'cabin', 'cabinet', 'cable', 'cadmium', 'cage', 'cajamarquilla', 'calf', 'calibrator', 'call', 'called', 'came', 'camera', 'camp', 'canario', 'cane', 'canterio', 'canvas', 'cap', 'car', 'carbon', 'cardan', 'care', 'carlos', 'carmen', 'carousel', 'carpenter', 'carpentry', 'carried', 'carry', 'carrying', 'cart', 'carton', 'casionndole', 'cast', 'casting', 'cat', 'catch', 'catching', 'catheter', 'cathode', 'cathodic', 'caught', 'cause', 'caused', 'causing', 'caustic', 'caving', 'ce', 'ceiling', 'cell', 'cement', 'center', 'central', 'centralizer', 'cep', 'ceremony', 'certain', 'cervical', 'cesar', 'chagua', 'chain', 'chair', 'chamber', 'change', 'changed', 'changing', 'channel', 'chapel', 'charging', 'check', 'checked', 'checking', 'cheek', 'cheekbone', 'chemical', 'chemo', 'chest', 'chestnut', 'chicken', 'chicoteo', 'chicrin', 'chief', 'chimney', 'chin', 'chirodactile', 'chirodactilo', 'chiropactyl', 'chisel', 'choco', 'chooses', 'chop', 'chopping', 'chuck', 'chuquillanqui', 'chute', 'chuteo', 'cia', 'ciliary', 'cinnamon', 'circuit', 'circumstance', 'citing', 'city', 'civil', 'civilian', 'clamp', 'clamping', 'classification', 'claudio', 'clean', 'cleaned', 'cleaning', 'clearing', 'clerk', 'click', 'climb', 'climbing', 'clinic', 'clogged', 'clogging', 'close', 'closed', 'closing', 'cloth', 'clothes', 'cluster', 'cm', 'cma', 'cmxcm', 'cmxcmxcm', 'coat', 'cocada', 'cockpit', 'code', 'coil', 'cold', 'collaborator', 'collar', 'colleague', 'collect', 'collecting', 'collection', 'collided', 'combination', 'come', 'comedor', 'comfort', 'coming', 'command', 'communicate', 'communicates', 'communication', 'company', 'compartment', 'complaining', 'complete', 'completed', 'completing', 'composed', 'composing', 'composition', 'compress', 'compressed', 'compressor', 'concentrate', 'concentrator', 'conchucos', 'conclusion', 'concrete', 'concreting', 'concussion', 'conditioning', 'conducting', 'conductive', 'cone', 'confined', 'confipetrol', 'confirming', 'congestion', 'connecting', 'connection', 'connector', 'consequence', 'consequently', 'consisted', 'construction', 'consultant', 'consultation', 'consulted', 'contact', 'contacting', 'contained', 'container', 'containing', 'containment', 'contaminated', 'content', 'continue', 'continued', 'continuing', 'continuously', 'contracture', 'control', 'contusion', 'conveyor', 'convoy', 'cook', 'cooker', 'cooking', 'cooling', 'coordinate', 'coordinated', 'coordination', 'copilot', 'copla', 'copper', 'cord', 'cormei', 'corner', 'correct', 'correcting', 'correctly', 'correspond', 'corresponding', 'corridor', 'corrugated', 'cosapi', 'costa', 'could', 'coupling', 'courier', 'cover', 'covered', 'coworker', 'cracking', 'crane', 'crash', 'creating', 'crest', 'crew', 'cristbal', 'cristian', 'cro', 'cross', 'crosscutter', 'crossed', 'crossing', 'crouching', 'crown', 'crucible', 'cruise', 'cruiser', 'crumbles', 'crusher', 'crushing', 'cruz', 'csar', 'cubic', 'cue', 'culminated', 'curl', 'curling', 'current', 'curve', 'cut', 'cutblunt', 'cutoff', 'cutter', 'cutting', 'cx', 'cycle', 'cyclone', 'cylinder', 'cylindrical', 'da', 'dado', 'damage', 'daniel', 'danillo', 'danon', 'data', 'day', 'dayme', 'dd', 'ddh', 'de', 'death', 'debarking', 'debris', 'deceased', 'december', 'decide', 'decided', 'decides', 'deconcentrates', 'decreasing', 'deenergized', 'deep', 'deepening', 'defective', 'defensive', 'defined', 'degree', 'delivery', 'demag', 'demineralization', 'demister', 'denis', 'depressurisation', 'depth', 'derailed', 'derails', 'derived', 'desanding', 'descended', 'descending', 'described', 'designated', 'designed', 'designing', 'deslaminadora', 'deslaminator', 'despite', 'detached', 'detaches', 'detaching', 'detachment', 'detecting', 'detector', 'deteriorated', 'detonating', 'detritus', 'developed', 'developing', 'deviate', 'device', 'diagnose', 'diagnosis', 'diagonal', 'diagonally', 'diamantina', 'diameter', 'diamond', 'diassis', 'die', 'diesel', 'difficult', 'digger', 'dimension', 'dining', 'dioxide', 'direct', 'directed', 'directing', 'direction', 'directly', 'directs', 'disabled', 'disassembled', 'disassembly', 'discharge', 'discharging', 'discomfort', 'disconnecting', 'disconnection', 'discovered', 'disengaged', 'dish', 'disintegrates', 'disk', 'dismantled', 'dismantling', 'dismount', 'displace', 'displacement', 'displaces', 'disposal', 'disrupted', 'distal', 'distance', 'distanced', 'distancing', 'distant', 'distracted', 'distribution', 'distributor', 'ditch', 'diversion', 'divert', 'diverting', 'divine', 'divino', 'dizziness', 'do', 'doctor', 'done', 'door', 'doosan', 'dosage', 'doser', 'downward', 'downwards', 'dragging', 'drain', 'drainage', 'drained', 'draining', 'drawer', 'drawing', 'drill', 'driller', 'drillerwas', 'drilling', 'drive', 'driven', 'driver', 'driving', 'drop', 'dropped', 'dropping', 'drove', 'drum', 'dry', 'drying', 'dtn', 'duct', 'due', 'dump', 'dumper', 'dune', 'dust', 'duty', 'duval', 'e', 'ear', 'earth', 'earthenware', 'easel', 'east', 'ecm', 'edge', 'eduardo', 'ee', 'effect', 'effective', 'effort', 'efran', 'eissa', 'ejecting', 'eka', 'el', 'elbow', 'ele', 'electric', 'electrical', 'electrician', 'electrolysis', 'electrolyte', 'electrometallurgy', 'electrowelded', 'element', 'elevation', 'eliseo', 'elismar', 'ematoma', 'embed', 'embedded', 'embedding', 'emergency', 'emerson', 'employee', 'empresa', 'emptiness', 'empty', 'emptying', 'emulsion', 'enabled', 'encountered', 'end', 'ended', 'endured', 'energize', 'energized', 'energy', 'enforce', 'engaged', 'engine', 'engineer', 'enmicadas', 'enoc', 'enough', 'ensuring', 'enter', 'entered', 'entering', 'enters', 'entire', 'entrance', 'entry', 'environment', 'environmental', 'epi', 'epp', 'epps', 'equally', 'equipment', 'er', 'erasing', 'eric', 'eriks', 'escape', 'esengrasante', 'estimated', 'estriping', 'eusbio', 'eustaquio', 'evacuate', 'evacuated', 'evacuation', 'evaluate', 'evaluated', 'evaluation', 'evaporator', 'even', 'event', 'everything', 'ex', 'examination', 'excavated', 'excavation', 'excavator', 'excess', 'excessive', 'exchange', 'exchanger', 'excited', 'excoriation', 'execution', 'exert', 'exerted', 'exerts', 'existence', 'exit', 'expansion', 'expedition', 'expelling', 'exploded', 'explomin', 'explosion', 'explosive', 'exposed', 'extension', 'external', 'extra', 'extracting', 'extraction', 'extruder', 'eye', 'eyebolt', 'eyebrow', 'eyelash', 'eyelet', 'eyelid', 'eyewash', 'f', 'fabio', 'fabric', 'face', 'facial', 'facila', 'facilitate', 'facility', 'fact', 'factory', 'failed', 'failure', 'faintness', 'fall', 'fallen', 'falling', 'false', 'fan', 'fanel', 'faneles', 'farm', 'fastening', 'faucet', 'favor', 'fbio', 'feast', 'fectuaban', 'feed', 'feeder', 'feeding', 'feel', 'feeling', 'felipe', 'felix', 'fell', 'felt', 'fence', 'fenced', 'fender', 'fernando', 'fernndezinjuredthe', 'ferranta', 'fi', 'fiberglass', 'field', 'fifth', 'figure', 'fill', 'filled', 'filling', 'filter', 'filtration', 'final', 'finally', 'find', 'finding', 'fine', 'finger', 'finish', 'finished', 'finishing', 'fire', 'fired', 'firmly', 'first', 'fish', 'fisherman', 'fissure', 'fit', 'fitting', 'five', 'fix', 'fixed', 'fixing', 'flammable', 'flange', 'flash', 'flat', 'flex', 'flexible', 'flexing', 'floor', 'flotation', 'flow', 'flyght', 'fm', 'foam', 'fogging', 'folder', 'foliage', 'followed', 'following', 'food', 'foot', 'footdeep', 'footwear', 'fop', 'force', 'forearm', 'forehead', 'foreman', 'forest', 'forklift', 'form', 'formation', 'formed', 'former', 'formerly', 'forward', 'found', 'foundry', 'four', 'fourth', 'fracture', 'fragment', 'fragmented', 'fragmentos', 'frame', 'francisco', 'frank', 'freddy', 'free', 'freed', 'friction', 'fright', 'frightened', 'front', 'frontal', 'frontally', 'fruit', 'ft', 'fuel', 'fulcrum', 'full', 'fully', 'functioning', 'funnel', 'furnace', 'fuse', 'future', 'fz', 'g', 'gable', 'gallery', 'gallon', 'gap', 'garit', 'garrote', 'gas', 'gate', 'gauge', 'gave', 'gaze', 'gear', 'gearbox', 'geho', 'general', 'generate', 'generated', 'generates', 'generating', 'geological', 'geologist', 'geologo', 'geology', 'geomembrane', 'georli', 'geosol', 'get', 'getting', 'gift', 'gilton', 'gilvnio', 'girdle', 'give', 'giving', 'glass', 'glove', 'go', 'goat', 'goggles', 'going', 'good', 'got', 'gps', 'gr', 'grab', 'grabbed', 'gram', 'granja', 'grate', 'grating', 'gravel', 'grazed', 'grazing', 'greater', 'grid', 'griff', 'grille', 'grinder', 'grinding', 'ground', 'group', 'grp', 'grs', 'gts', 'guard', 'guide', 'guillotine', 'gun', 'gutter', 'h', 'habilitation', 'half', 'hammer', 'hand', 'handle', 'handling', 'handrail', 'hanging', 'happened', 'happens', 'hardened', 'harness', 'hastial', 'hat', 'hatch', 'hattype', 'hauling', 'hdp', 'hdpe', 'head', 'heading', 'headlight', 'health', 'heard', 'hears', 'heat', 'heated', 'heating', 'heavy', 'heel', 'height', 'held', 'helical', 'helmet', 'help', 'helper', 'hematoma', 'hemiface', 'hexagonal', 'hiab', 'hidalgo', 'high', 'highway', 'hill', 'hinge', 'hip', 'hissing', 'hit', 'hitchhiking', 'hitting', 'hm', 'hoe', 'hoist', 'hoisting', 'hoistings', 'hold', 'holder', 'holding', 'hole', 'hood', 'hook', 'hooked', 'hopper', 'horizontal', 'horizontally', 'horse', 'hose', 'hospital', 'hot', 'hour', 'house', 'housing', 'hq', 'hr', 'humped', 'hurried', 'hw', 'hycrontype', 'hydraulic', 'hydrojet', 'hydroxide', 'hyt', 'ice', 'identified', 'identifies', 'identify', 'iglu', 'ignited', 'igniting', 'igor', 'ii', 'iii', 'illness', 'imbalance', 'immediate', 'immediately', 'impact', 'impacted', 'impacting', 'importance', 'impregnated', 'imprisoned', 'imprisoning', 'imprisonment', 'imprisons', 'impromec', 'improve', 'incentration', 'inch', 'inchancable', 'inchancables', 'inchancanbles', 'incident', 'incimet', 'incimmet', 'inclination', 'inclined', 'including', 'increase', 'index', 'indexed', 'indicate', 'indicated', 'indicates', 'industrial', 'inefficacy', 'inertia', 'inferior', 'informed', 'informs', 'infrastructure', 'ingot', 'initial', 'initiate', 'initiated', 'initiating', 'injection', 'injured', 'injures', 'injuring', 'injury', 'inlet', 'inner', 'insect', 'insertion', 'inside', 'inspect', 'inspecting', 'inspection', 'install', 'installation', 'installed', 'installing', 'instant', 'instep', 'instructed', 'insulation', 'intense', 'intention', 'interior', 'interlaced', 'intermediate', 'internal', 'intersection', 'inthinc', 'introduce', 'introduced', 'introduces', 'invaded', 'investigation', 'involuntarily', 'involved', 'inward', 'ip', 'iron', 'ironing', 'irritation', 'iscmg', 'isidro', 'isolated', 'ith', 'iv', 'ja', 'jaba', 'jack', 'jacket', 'jackleg', 'jaw', 'jehovah', 'jehovnio', 'jesus', 'jet', 'jetanol', 'jhon', 'jhonatan', 'jhony', 'jib', 'jka', 'job', 'joint', 'jos', 'jose', 'josimar', 'juan', 'julio', 'july', 'jumbo', 'jump', 'jumped', 'juna', 'junior', 'juveni', 'kelly', 'kept', 'kevin', 'key', 'keypad', 'kg', 'kicked', 'killer', 'kiln', 'kitchen', 'km', 'knee', 'kneeling', 'knife', 'know', 'known', 'knuckle', 'kv', 'l', 'la', 'label', 'labeling', 'labor', 'laboratory', 'laceration', 'lack', 'ladder', 'laden', 'lady', 'lajes', 'laminator', 'lamp', 'lance', 'lane', 'laquia', 'large', 'lash', 'last', 'later', 'lateral', 'laterally', 'latter', 'launch', 'launched', 'launcher', 'launching', 'laundry', 'lavras', 'lay', 'lb', 'leaching', 'lead', 'leader', 'leaf', 'leak', 'leakage', 'lean', 'leandro', 'leaning', 'leather', 'leathertype', 'leave', 'leaving', 'lectrowelded', 'led', 'left', 'leg', 'legging', 'lemon', 'length', 'lens', 'lesion', 'leucenas', 'level', 'lever', 'lhd', 'liana', 'license', 'lid', 'lifeline', 'lift', 'lifted', 'lifting', 'light', 'lighthouse', 'like', 'liliana', 'lima', 'limb', 'lime', 'line', 'lineman', 'lining', 'link', 'lip', 'liquid', 'list', 'lit', 'liter', 'litorina', 'litter', 'little', 'lloclla', 'lm', 'load', 'loaded', 'loader', 'loading', 'local', 'localized', 'locate', 'located', 'location', 'lock', 'locked', 'locker', 'locking', 'locomotive', 'lodged', 'long', 'longer', 'look', 'looked', 'looking', 'lookout', 'loose', 'loosen', 'loosened', 'loosening', 'loosens', 'lose', 'loses', 'losing', 'lost', 'loud', 'low', 'lower', 'lowered', 'lowvoltage', 'lt', 'ltda', 'lubricant', 'lubricating', 'lubrication', 'lubricator', 'lucas', 'luciano', 'luis', 'luiz', 'lumbar', 'luna', 'lunch', 'lung', 'luxofractures', 'lxbb', 'lxpb', 'lying', 'lyner', 'lzaro', 'macedonio', 'machete', 'machine', 'machinery', 'made', 'maestranza', 'mag', 'magazine', 'magnetometer', 'magnetometric', 'maid', 'main', 'maintaining', 'maintenance', 'make', 'making', 'mallet', 'man', 'managed', 'management', 'manages', 'managing', 'manco', 'manetometer', 'maneuver', 'mangote', 'manhole', 'manifestation', 'manifested', 'manipulate', 'manipulated', 'manipulates', 'manipulating', 'manipulation', 'manipulator', 'manitou', 'manoel', 'manual', 'manually', 'manuel', 'maperu', 'mapping', 'marble', 'marcelo', 'marco', 'marcos', 'marcy', 'maribondos', 'marimbondo', 'marimbondos', 'mario', 'marked', 'marking', 'martinpole', 'mask', 'maslucan', 'mason', 'master', 'mat', 'mata', 'material', 'maximum', 'mc', 'mceisa', 'mean', 'measurement', 'measuring', 'mechanic', 'mechanical', 'mechanized', 'medical', 'medicated', 'medicine', 'melt', 'melting', 'member', 'mesh', 'messrs', 'metal', 'metallic', 'metatarsal', 'meter', 'middle', 'miguel', 'mild', 'mill', 'milling', 'milpo', 'milton', 'mina', 'mincing', 'mine', 'mineral', 'mini', 'mining', 'minor', 'minute', 'misalignment', 'missing', 'mix', 'mixed', 'mixer', 'mixkret', 'mixture', 'ml', 'mobile', 'module', 'mollares', 'mollaress', 'moment', 'mona', 'monitoring', 'monkey', 'month', 'moon', 'mooring', 'morais', 'mortar', 'moth', 'motion', 'motor', 'motorist', 'mount', 'mounted', 'mouth', 'move', 'moved', 'movement', 'moving', 'mr', 'mrcio', 'mrio', 'mt', 'mud', 'mudswathed', 'municipal', 'murilo', 'muscle', 'mv', 'mx', 'mxm', 'mxmxm', 'n', 'nail', 'nailed', 'nailing', 'nascimento', 'natclar', 'nd', 'near', 'nearby', 'necessary', 'neck', 'need', 'needed', 'needle', 'negative', 'neglected', 'neutral', 'new', 'next', 'night', 'nilton', 'nipple', 'nitric', 'noise', 'none', 'nonsustained', 'normal', 'normally', 'north', 'nose', 'note', 'notebook', 'noted', 'notice', 'noticed', 'noticing', 'novo', 'nozzle', 'nq', 'nro', 'nut', 'nv', 'nylon', 'ob', 'oba', 'obb', 'object', 'observe', 'observed', 'observes', 'observing', 'obstruct', 'obstructed', 'obstructing', 'obstruction', 'occupant', 'occurred', 'occurring', 'occurs', 'office', 'official', 'oil', 'old', 'ompressor', 'one', 'onto', 'op', 'open', 'opened', 'opening', 'operate', 'operated', 'operates', 'operating', 'operation', 'operational', 'operator', 'opposite', 'orange', 'order', 'ordinary', 'ore', 'originating', 'orlando', 'oscillation', 'osorio', 'outcrop', 'outlet', 'outpatient', 'outside', 'oven', 'overall', 'overcoming', 'overexertion', 'overflow', 'overhanging', 'overhead', 'overheating', 'overlap', 'overpressure', 'overturning', 'oxicorte', 'oxide', 'oxyfuel', 'pablo', 'pack', 'package', 'packaging', 'pad', 'page', 'paid', 'pain', 'paint', 'painting', 'palm', 'panel', 'pant', 'paracatu', 'paralysis', 'paralyze', 'paralyzed', 'paralyzes', 'park', 'parked', 'parking', 'part', 'partially', 'participating', 'particle', 'partner', 'pas', 'pasco', 'pass', 'passage', 'passed', 'passing', 'paste', 'pasture', 'path', 'patrol', 'patronal', 'paulo', 'paused', 'pb', 'pead', 'pear', 'pedal', 'pedestal', 'pedro', 'peeling', 'pen', 'pendulum', 'pentacord', 'penultimate', 'people', 'per', 'perceived', 'perceives', 'percussion', 'perforation', 'perform', 'performed', 'performer', 'performing', 'performs', 'period', 'peristaltic', 'person', 'personal', 'personnel', 'phalanx', 'phase', 'photo', 'photograph', 'physician', 'pick', 'pickaxe', 'picking', 'pickup', 'piece', 'pierce', 'pierced', 'piercing', 'pig', 'pillar', 'pilot', 'pin', 'pink', 'pinking', 'pinning', 'pipe', 'pipeline', 'pipette', 'piping', 'pique', 'piquero', 'piston', 'pit', 'pivot', 'place', 'placed', 'placement', 'placing', 'planamieto', 'planning', 'plant', 'plastic', 'plate', 'platform', 'play', 'plug', 'pm', 'pneumatic', 'pocket', 'point', 'pointed', 'pole', 'polling', 'polyethylene', 'polymer', 'polyontusions', 'polypropylene', 'polyurethane', 'pom', 'poncho', 'porangatu', 'portable', 'portion', 'porvenir', 'position', 'positioned', 'positioning', 'positive', 'possible', 'possibly', 'post', 'pot', 'potion', 'pound', 'pouring', 'povoado', 'powder', 'power', 'ppe', 'ppes', 'pre', 'preparation', 'prepared', 'prepares', 'preparing', 'prescribing', 'presence', 'present', 'presented', 'presenting', 'press', 'pressed', 'pressing', 'pressure', 'preuse', 'prevent', 'prevented', 'preventive', 'previous', 'previously', 'prick', 'pricked', 'prils', 'primary', 'probe', 'problem', 'procedure', 'proceed', 'proceeded', 'proceeding', 'proceeds', 'process', 'produce', 'produced', 'producing', 'product', 'production', 'profile', 'progress', 'progressive', 'proingcom', 'project', 'projected', 'projecting', 'projection', 'promptly', 'prong', 'propeller', 'properly', 'propicindose', 'prospector', 'protection', 'protective', 'protector', 'protruded', 'protruding', 'provoking', 'proximal', 'psi', 'public', 'puddle', 'pull', 'pulled', 'pulley', 'pulling', 'pulp', 'pulpomatic', 'pump', 'pumping', 'purification', 'push', 'pushed', 'pushing', 'put', 'putting', 'putty', 'pvc', 'pvctype', 'pyrotechnic', 'queneche', 'quickly', 'quinoa', 'quirodactilo', 'quirodactyl', 'r', 'rack', 'radial', 'radiator', 'radio', 'radius', 'rafael', 'rag', 'rail', 'railing', 'railway', 'raise', 'raised', 'raising', 'rake', 'ramp', 'rampa', 'ran', 'rapid', 'raspndose', 'raul', 'ravine', 'rb', 'rd', 'reach', 'reached', 'reaching', 'reacting', 'reaction', 'reactive', 'readjusted', 'realize', 'realized', 'realizes', 'realizing', 'rear', 'reason', 'rebound', 'receive', 'received', 'receiving', 'recently', 'reception', 'reciprocating', 'reconnaissance', 'recovery', 'redness', 'reduce', 'reduced', 'reducer', 'reduction', 'reel', 'reevaluation', 'reference', 'referred', 'reflux', 'refractory', 'refrigerant', 'refuge', 'refurbishment', 'region', 'registered', 'reinforce', 'reinstallation', 'release', 'released', 'releasing', 'remained', 'remaining', 'remains', 'remedy', 'removal', 'remove', 'removed', 'removing', 'renato', 'repair', 'replacing', 'report', 'reported', 'reporting', 'reposition', 'representing', 'repulping', 'request', 'required', 'requires', 'resane', 'rescued', 'research', 'reserve', 'reshaping', 'residence', 'resident', 'residual', 'residue', 'resin', 'resistance', 'respective', 'respirator', 'respond', 'response', 'responsible', 'rest', 'restart', 'restarting', 'rested', 'resting', 'restricts', 'result', 'resulted', 'resulting', 'retire', 'retired', 'retiring', 'retraction', 'retracts', 'retreat', 'return', 'returned', 'returning', 'revegetation', 'reverse', 'review', 'rhainer', 'rhyming', 'ribbon', 'rice', 'riding', 'rig', 'rigger', 'right', 'rim', 'ring', 'ripped', 'ripper', 'rise', 'risk', 'rivet', 'rlc', 'road', 'robot', 'robson', 'rock', 'rocker', 'rod', 'roger', 'rolando', 'roll', 'rolled', 'roller', 'rolling', 'rollover', 'romn', 'ronald', 'roof', 'room', 'rope', 'rops', 'rotary', 'rotate', 'rotated', 'rotates', 'rotation', 'rotor', 'routine', 'row', 'roy', 'rp', 'rpa', 'rub', 'rubber', 'rubbing', 'rugged', 'rung', 'rupture', 'ruptured', 'rushed', 's', 'sa', 'sacrifice', 'sacrificial', 'saddle', 'safe', 'safety', 'said', 'sailor', 'sample', 'sampler', 'sampling', 'samuel', 'sand', 'sanding', 'sanitation', 'santa', 'santos', 'sardinel', 'saturated', 'saw', 'saying', 'scaffold', 'scaffolding', 'scaler', 'scaller', 'scalp', 'scare', 'sccop', 'scheduled', 'scissor', 'scoop', 'scooptram', 'scoria', 'scorpion', 'scrap', 'scraper', 'screen', 'screw', 'screwdriver', 'scruber', 'seal', 'sealing', 'seam', 'seat', 'seatbelt', 'second', 'secondary', 'section', 'sectioned', 'secured', 'securing', 'security', 'sediment', 'sedimentation', 'see', 'seeing', 'seen', 'segment', 'semikneeling', 'sensation', 'sensor', 'september', 'serf', 'serious', 'serra', 'servant', 'service', 'servitecforaco', 'set', 'setting', 'settling', 'seven', 'several', 'sf', 'shaft', 'shake', 'shaking', 'shallow', 'shank', 'shape', 'shaped', 'share', 'sharply', 'shear', 'sheepskin', 'sheet', 'shell', 'shield', 'shift', 'shifted', 'shining', 'shipment', 'shipper', 'shipping', 'shirt', 'shock', 'shockbearing', 'shocrete', 'shoe', 'shooting', 'short', 'shorten', 'shot', 'shotcrete', 'shotcreteados', 'shotcreterepentinamente', 'shoulder', 'shovel', 'shower', 'shown', 'shutter', 'shuttering', 'sickle', 'side', 'siemag', 'signal', 'signaling', 'silicate', 'silo', 'silva', 'silver', 'simba', 'simultaneously', 'since', 'sink', 'sip', 'sit', 'site', 'sits', 'sitting', 'situation', 'size', 'sketched', 'skid', 'skimmer', 'skin', 'skip', 'slab', 'slag', 'slaughter', 'sledgehammer', 'sleeper', 'sleeve', 'slid', 'slide', 'sliding', 'slight', 'slightly', 'slimming', 'sling', 'slip', 'slipped', 'slippery', 'slipping', 'slope', 'sloping', 'slow', 'sludge', 'small', 'snack', 'snake', 'socket', 'socorro', 'soda', 'sodium', 'soft', 'soil', 'soiling', 'soldering', 'sole', 'solid', 'solubilization', 'solution', 'soon', 'soquet', 'sought', 'sound', 'south', 'space', 'span', 'spare', 'spark', 'spatter', 'spatula', 'spear', 'speart', 'specific', 'specified', 'spent', 'spike', 'spill', 'spilled', 'spilling', 'spillway', 'spine', 'splash', 'splashed', 'splinter', 'split', 'spoiler', 'spool', 'spoon', 'sprain', 'spume', 'spun', 'square', 'squat', 'squatting', 'sr', 'srgio', 'ssomac', 'st', 'sta', 'stability', 'stabilizer', 'stabilizes', 'stacked', 'stacker', 'stacking', 'staff', 'stage', 'stair', 'staircase', 'stake', 'stand', 'standardization', 'standing', 'start', 'started', 'starter', 'starting', 'startup', 'state', 'station', 'stationed', 'steam', 'steel', 'steep', 'steering', 'stem', 'step', 'stepladder', 'stepped', 'stepping', 'still', 'stilson', 'sting', 'stinging', 'stir', 'stirrup', 'stitch', 'stone', 'stood', 'stool', 'stooped', 'stop', 'stope', 'stoppage', 'stopped', 'stopper', 'storage', 'store', 'storm', 'stp', 'straight', 'strained', 'strap', 'street', 'strength', 'stretch', 'stretched', 'stretcher', 'strike', 'striking', 'strip', 'stripping', 'stroke', 'strong', 'struck', 'structure', 'strut', 'stuck', 'stumble', 'stumbled', 'stump', 'stun', 'stung', 'stylet', 'subjection', 'submerged', 'subsequent', 'subsequently', 'substation', 'success', 'suction', 'sudden', 'suddenly', 'suffered', 'suffering', 'suffers', 'suitably', 'sul', 'sulfate', 'sulfide', 'sulfur', 'sulfuric', 'sulphate', 'sulphide', 'sump', 'sunday', 'sunglass', 'superciliary', 'superficial', 'superficially', 'superior', 'supervise', 'supervising', 'supervision', 'supervisor', 'supervisory', 'supply', 'support', 'supported', 'supporting', 'surcharge', 'sure', 'surface', 'surprised', 'surrounding', 'survey', 'surveying', 'suspended', 'suspender', 'sustained', 'sustaining', 'suture', 'sutured', 'swarm', 'swarming', 'sweep', 'swelling', 'swing', 'switched', 'symptom', 'system', 'table', 'tabola', 'tabolas', 'tail', 'tailing', 'tajo', 'take', 'taken', 'taking', 'talus', 'tangled', 'tank', 'tanker', 'tape', 'tapped', 'taque', 'target', 'task', 'taut', 'tc', 'teacher', 'team', 'teammate', 'tearing', 'technical', 'technician', 'tecl', 'tecla', 'tecle', 'tecnomin', 'telescopic', 'tell', 'tellomoinsac', 'temporarily', 'temporary', 'tension', 'tenth', 'test', 'testimony', 'tether', 'th', 'thermal', 'thermomagnetic', 'thickener', 'thickness', 'thigh', 'thinner', 'third', 'thorax', 'thorn', 'thread', 'three', 'threeway', 'threw', 'throw', 'throwing', 'thrown', 'thrust', 'thug', 'thumb', 'thunderous', 'thus', 'tick', 'tie', 'tied', 'tightened', 'tightening', 'tightens', 'tilt', 'tilted', 'time', 'timely', 'tip', 'tipper', 'tire', 'tirfor', 'tirford', 'tito', 'tj', 'tk', 'tm', 'tn', 'toe', 'toecap', 'together', 'toilet', 'told', 'ton', 'took', 'tool', 'top', 'topographic', 'torch', 'torque', 'torres', 'total', 'touch', 'touched', 'tour', 'toward', 'towards', 'tower', 'toxicity', 'toy', 'tq', 'tqs', 'track', 'tractor', 'trailer', 'trainee', 'tranfer', 'tranquera', 'transfe', 'transfer', 'transferred', 'transformer', 'transit', 'transiting', 'transmission', 'transport', 'transported', 'transporting', 'transverse', 'transversely', 'trap', 'trapped', 'trapping', 'trauma', 'traumatic', 'traumatism', 'traveled', 'traveling', 'traversed', 'tray', 'tread', 'treading', 'treated', 'treatment', 'tree', 'trellex', 'trench', 'trestle', 'triangular', 'tried', 'trip', 'truck', 'try', 'trying', 'tube', 'tubing', 'tubo', 'tucum', 'tunel', 'tunnel', 'turn', 'turned', 'turning', 'turntable', 'twice', 'twist', 'twisted', 'twisting', 'two', 'tying', 'type', 'tyrfor', 'unbalanced', 'unbalancing', 'unclog', 'uncoupled', 'uncover', 'underground', 'underwent', 'uneven', 'unevenness', 'unexpectedly', 'unhooking', 'unicon', 'uniform', 'union', 'unit', 'unleashing', 'unload', 'unloaded', 'unloading', 'unlock', 'unlocking', 'unscrew', 'unstable', 'untie', 'untied', 'untimely', 'upon', 'upper', 'upward', 'upwards', 'us', 'use', 'used', 'using', 'ustulacin', 'ustulado', 'ustulador', 'ustulation', 'usual', 'utensil', 'v', 'vacuum', 'valve', 'van', 'vanishes', 'vazante', 'vegetation', 'vehicle', 'ventilation', 'verification', 'verified', 'verifies', 'verify', 'verifying', 'vertical', 'vertically', 'via', 'vial', 'victalica', 'victim', 'victor', 'vieira', 'vine', 'violent', 'violently', 'virdro', 'visibility', 'vision', 'visit', 'visited', 'vista', 'visual', 'visualizes', 'vitaulic', 'vms', 'void', 'voltage', 'volumetric', 'volvo', 'vsd', 'waelz', 'wagon', 'waiting', 'walk', 'walked', 'walking', 'wall', 'walrus', 'walter', 'wanted', 'wanting', 'warehouse', 'warley', 'warman', 'warning', 'warp', 'warrin', 'wash', 'washed', 'washing', 'wasp', 'waste', 'watch', 'water', 'watered', 'watermelon', 'waterthinner', 'waxed', 'way', 'wca', 'weakly', 'wear', 'wearing', 'wedge', 'weed', 'weevil', 'weighing', 'weighs', 'weight', 'weld', 'welder', 'welding', 'well', 'wellfield', 'went', 'west', 'wet', 'wheel', 'wheelbarrow', 'whiplash', 'whistling', 'wick', 'wide', 'width', 'wila', 'wilber', 'wilder', 'william', 'willing', 'wilmer', 'winch', 'winche', 'window', 'winemaker', 'winery', 'wire', 'withdrawal', 'withdrawing', 'withdrew', 'within', 'without', 'wk', 'woman', 'wood', 'wooden', 'wore', 'work', 'worked', 'worker', 'workermechanic', 'working', 'workplace', 'workshop', 'worn', 'would', 'wound', 'wounding', 'wrench', 'wrist', 'x', 'xcm', 'xix', 'xray', 'xrd', 'xx', 'xxcm', 'xxx', 'yaranga', 'yard', 'ydrs', 'yield', 'yolk', 'young', 'z', 'zaf', 'zamac', 'zero', 'zinc', 'zinco', 'zn', 'zone']
for dtype in ISH_NLP_Word2Vec_df.dtypes.unique():
print(f"Columns of type {dtype}:")
print(ISH_NLP_Word2Vec_df.select_dtypes(include=[dtype]).columns.tolist())
print()
Columns of type object: ['Country', 'City', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description'] Columns of type int64: ['DayOfWeek', 'Year', 'Month', 'Day'] Columns of type float32: ['Word2Vec_0', 'Word2Vec_1', 'Word2Vec_2', 'Word2Vec_3', 'Word2Vec_4', 'Word2Vec_5', 'Word2Vec_6', 'Word2Vec_7', 'Word2Vec_8', 'Word2Vec_9', 'Word2Vec_10', 'Word2Vec_11', 'Word2Vec_12', 'Word2Vec_13', 'Word2Vec_14', 'Word2Vec_15', 'Word2Vec_16', 'Word2Vec_17', 'Word2Vec_18', 'Word2Vec_19', 'Word2Vec_20', 'Word2Vec_21', 'Word2Vec_22', 'Word2Vec_23', 'Word2Vec_24', 'Word2Vec_25', 'Word2Vec_26', 'Word2Vec_27', 'Word2Vec_28', 'Word2Vec_29', 'Word2Vec_30', 'Word2Vec_31', 'Word2Vec_32', 'Word2Vec_33', 'Word2Vec_34', 'Word2Vec_35', 'Word2Vec_36', 'Word2Vec_37', 'Word2Vec_38', 'Word2Vec_39', 'Word2Vec_40', 'Word2Vec_41', 'Word2Vec_42', 'Word2Vec_43', 'Word2Vec_44', 'Word2Vec_45', 'Word2Vec_46', 'Word2Vec_47', 'Word2Vec_48', 'Word2Vec_49', 'Word2Vec_50', 'Word2Vec_51', 'Word2Vec_52', 'Word2Vec_53', 'Word2Vec_54', 'Word2Vec_55', 'Word2Vec_56', 'Word2Vec_57', 'Word2Vec_58', 'Word2Vec_59', 'Word2Vec_60', 'Word2Vec_61', 'Word2Vec_62', 'Word2Vec_63', 'Word2Vec_64', 'Word2Vec_65', 'Word2Vec_66', 'Word2Vec_67', 'Word2Vec_68', 'Word2Vec_69', 'Word2Vec_70', 'Word2Vec_71', 'Word2Vec_72', 'Word2Vec_73', 'Word2Vec_74', 'Word2Vec_75', 'Word2Vec_76', 'Word2Vec_77', 'Word2Vec_78', 'Word2Vec_79', 'Word2Vec_80', 'Word2Vec_81', 'Word2Vec_82', 'Word2Vec_83', 'Word2Vec_84', 'Word2Vec_85', 'Word2Vec_86', 'Word2Vec_87', 'Word2Vec_88', 'Word2Vec_89', 'Word2Vec_90', 'Word2Vec_91', 'Word2Vec_92', 'Word2Vec_93', 'Word2Vec_94', 'Word2Vec_95', 'Word2Vec_96', 'Word2Vec_97', 'Word2Vec_98', 'Word2Vec_99', 'Word2Vec_100', 'Word2Vec_101', 'Word2Vec_102', 'Word2Vec_103', 'Word2Vec_104', 'Word2Vec_105', 'Word2Vec_106', 'Word2Vec_107', 'Word2Vec_108', 'Word2Vec_109', 'Word2Vec_110', 'Word2Vec_111', 'Word2Vec_112', 'Word2Vec_113', 'Word2Vec_114', 'Word2Vec_115', 'Word2Vec_116', 'Word2Vec_117', 'Word2Vec_118', 'Word2Vec_119', 'Word2Vec_120', 'Word2Vec_121', 'Word2Vec_122', 'Word2Vec_123', 'Word2Vec_124', 'Word2Vec_125', 'Word2Vec_126', 'Word2Vec_127', 'Word2Vec_128', 'Word2Vec_129', 'Word2Vec_130', 'Word2Vec_131', 'Word2Vec_132', 'Word2Vec_133', 'Word2Vec_134', 'Word2Vec_135', 'Word2Vec_136', 'Word2Vec_137', 'Word2Vec_138', 'Word2Vec_139', 'Word2Vec_140', 'Word2Vec_141', 'Word2Vec_142', 'Word2Vec_143', 'Word2Vec_144', 'Word2Vec_145', 'Word2Vec_146', 'Word2Vec_147', 'Word2Vec_148', 'Word2Vec_149', 'Word2Vec_150', 'Word2Vec_151', 'Word2Vec_152', 'Word2Vec_153', 'Word2Vec_154', 'Word2Vec_155', 'Word2Vec_156', 'Word2Vec_157', 'Word2Vec_158', 'Word2Vec_159', 'Word2Vec_160', 'Word2Vec_161', 'Word2Vec_162', 'Word2Vec_163', 'Word2Vec_164', 'Word2Vec_165', 'Word2Vec_166', 'Word2Vec_167', 'Word2Vec_168', 'Word2Vec_169', 'Word2Vec_170', 'Word2Vec_171', 'Word2Vec_172', 'Word2Vec_173', 'Word2Vec_174', 'Word2Vec_175', 'Word2Vec_176', 'Word2Vec_177', 'Word2Vec_178', 'Word2Vec_179', 'Word2Vec_180', 'Word2Vec_181', 'Word2Vec_182', 'Word2Vec_183', 'Word2Vec_184', 'Word2Vec_185', 'Word2Vec_186', 'Word2Vec_187', 'Word2Vec_188', 'Word2Vec_189', 'Word2Vec_190', 'Word2Vec_191', 'Word2Vec_192', 'Word2Vec_193', 'Word2Vec_194', 'Word2Vec_195', 'Word2Vec_196', 'Word2Vec_197', 'Word2Vec_198', 'Word2Vec_199', 'Word2Vec_200', 'Word2Vec_201', 'Word2Vec_202', 'Word2Vec_203', 'Word2Vec_204', 'Word2Vec_205', 'Word2Vec_206', 'Word2Vec_207', 'Word2Vec_208', 'Word2Vec_209', 'Word2Vec_210', 'Word2Vec_211', 'Word2Vec_212', 'Word2Vec_213', 'Word2Vec_214', 'Word2Vec_215', 'Word2Vec_216', 'Word2Vec_217', 'Word2Vec_218', 'Word2Vec_219', 'Word2Vec_220', 'Word2Vec_221', 'Word2Vec_222', 'Word2Vec_223', 'Word2Vec_224', 'Word2Vec_225', 'Word2Vec_226', 'Word2Vec_227', 'Word2Vec_228', 'Word2Vec_229', 'Word2Vec_230', 'Word2Vec_231', 'Word2Vec_232', 'Word2Vec_233', 'Word2Vec_234', 'Word2Vec_235', 'Word2Vec_236', 'Word2Vec_237', 'Word2Vec_238', 'Word2Vec_239', 'Word2Vec_240', 'Word2Vec_241', 'Word2Vec_242', 'Word2Vec_243', 'Word2Vec_244', 'Word2Vec_245', 'Word2Vec_246', 'Word2Vec_247', 'Word2Vec_248', 'Word2Vec_249', 'Word2Vec_250', 'Word2Vec_251', 'Word2Vec_252', 'Word2Vec_253', 'Word2Vec_254', 'Word2Vec_255', 'Word2Vec_256', 'Word2Vec_257', 'Word2Vec_258', 'Word2Vec_259', 'Word2Vec_260', 'Word2Vec_261', 'Word2Vec_262', 'Word2Vec_263', 'Word2Vec_264', 'Word2Vec_265', 'Word2Vec_266', 'Word2Vec_267', 'Word2Vec_268', 'Word2Vec_269', 'Word2Vec_270', 'Word2Vec_271', 'Word2Vec_272', 'Word2Vec_273', 'Word2Vec_274', 'Word2Vec_275', 'Word2Vec_276', 'Word2Vec_277', 'Word2Vec_278', 'Word2Vec_279', 'Word2Vec_280', 'Word2Vec_281', 'Word2Vec_282', 'Word2Vec_283', 'Word2Vec_284', 'Word2Vec_285', 'Word2Vec_286', 'Word2Vec_287', 'Word2Vec_288', 'Word2Vec_289', 'Word2Vec_290', 'Word2Vec_291', 'Word2Vec_292', 'Word2Vec_293', 'Word2Vec_294', 'Word2Vec_295', 'Word2Vec_296', 'Word2Vec_297', 'Word2Vec_298', 'Word2Vec_299']
Label encode Accident level and Potential Accident Level in all the 3 dataframes
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Encode 'Accident Level' and 'Potential Accident Level' in ISH_NLP_Glove_df
ISH_NLP_Glove_df['Accident Level'] = label_encoder.fit_transform(ISH_NLP_Glove_df['Accident Level'])
ISH_NLP_Glove_df['Potential Accident Level'] = label_encoder.fit_transform(ISH_NLP_Glove_df['Potential Accident Level'])
# Encode 'Accident Level' and 'Potential Accident Level' in ISH_NLP_TFIDF_df
ISH_NLP_TFIDF_df['Accident Level'] = label_encoder.fit_transform(ISH_NLP_TFIDF_df['Accident Level'])
ISH_NLP_TFIDF_df['Potential Accident Level'] = label_encoder.fit_transform(ISH_NLP_TFIDF_df['Potential Accident Level'])
# Encode 'Accident Level' and 'Potential Accident Level' in ISH_NLP_Word2Vec_df
ISH_NLP_Word2Vec_df['Accident Level'] = label_encoder.fit_transform(ISH_NLP_Word2Vec_df['Accident Level'])
ISH_NLP_Word2Vec_df['Potential Accident Level'] = label_encoder.fit_transform(ISH_NLP_Word2Vec_df['Potential Accident Level'])
# Columns to drop
columns_to_drop = ['Year', 'Month', 'Day', 'Potential Accident Level', 'Description']
# Drop columns from each DataFrame
ISH_NLP_Glove_df = ISH_NLP_Glove_df.drop(columns_to_drop, axis=1)
ISH_NLP_TFIDF_df = ISH_NLP_TFIDF_df.drop(columns_to_drop, axis=1)
ISH_NLP_Word2Vec_df = ISH_NLP_Word2Vec_df.drop(columns_to_drop, axis=1)
# Calculate target variable distribution for each DataFrame
glove_target_dist = ISH_NLP_Glove_df['Accident Level'].value_counts(normalize=False)
tfidf_target_dist = ISH_NLP_TFIDF_df['Accident Level'].value_counts(normalize=False)
word2vec_target_dist = ISH_NLP_Word2Vec_df['Accident Level'].value_counts(normalize=False)
# Create a DataFrame to display the distributions
target_distribution_df = pd.DataFrame({
'Glove': glove_target_dist,
'TF-IDF': tfidf_target_dist,
'Word2Vec': word2vec_target_dist
})
# Print the DataFrame
target_distribution_df
| Glove | TF-IDF | Word2Vec | |
|---|---|---|---|
| Accident Level | |||
| 0 | 309 | 309 | 309 |
| 1 | 40 | 40 | 40 |
| 2 | 31 | 31 | 31 |
| 3 | 30 | 30 | 30 |
| 4 | 8 | 8 | 8 |
Target Variable Distribution:
Implications for Modeling:
!pip install imblearn
Collecting imblearn Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.3) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4) Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.3.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0) Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB) Installing collected packages: imblearn Successfully installed imblearn-0.0
# Balance 'Accident Level' using SMOTE. for all the 3 dataframes.
# Converting categorical features to numerical using one-hot encoding
import pandas as pd
from imblearn.over_sampling import SMOTE
# Function to balance data and one-hot encode categorical features
def balance_and_encode(df):
# Separate features and target variable
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
# One-hot encode categorical features (if any)
categorical_features = X.select_dtypes(include=['object']).columns
if categorical_features.any():
X_encoded = pd.get_dummies(X, columns=categorical_features, dtype=int, drop_first=True)
else:
X_encoded = X
# One-hot encode 'DayOfWeek'
X_encoded = pd.get_dummies(X_encoded, columns=['DayOfWeek'], dtype=int, drop_first=True)
# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_encoded, y)
# Combine balanced features and target
balanced_df = pd.concat([X_resampled, y_resampled], axis=1)
return balanced_df
# Apply the function to each DataFrame
ISH_NLP_Glove_df_Bal = balance_and_encode(ISH_NLP_Glove_df)
ISH_NLP_TFIDF_df_Bal = balance_and_encode(ISH_NLP_TFIDF_df)
ISH_NLP_Word2Vec_df_Bal = balance_and_encode(ISH_NLP_Word2Vec_df)
# Calculate balanced target variable distribution for each DataFrame
glove_balanced_dist = ISH_NLP_Glove_df_Bal['Accident Level'].value_counts(normalize=False)
tfidf_balanced_dist = ISH_NLP_TFIDF_df_Bal['Accident Level'].value_counts(normalize=False)
word2vec_balanced_dist = ISH_NLP_Word2Vec_df_Bal['Accident Level'].value_counts(normalize=False)
# Create a DataFrame to display the balanced distributions
Balanced_Distribution_df = pd.DataFrame({
'Glove (Balanced)': glove_balanced_dist,
'TF-IDF (Balanced)': tfidf_balanced_dist,
'Word2Vec (Balanced)': word2vec_balanced_dist
})
# Print the DataFrame
Balanced_Distribution_df
| Glove (Balanced) | TF-IDF (Balanced) | Word2Vec (Balanced) | |
|---|---|---|---|
| Accident Level | |||
| 0 | 309 | 309 | 309 |
| 3 | 309 | 309 | 309 |
| 2 | 309 | 309 | 309 |
| 1 | 309 | 309 | 309 |
| 4 | 309 | 309 | 309 |
ISH_NLP_Glove_df_Bal
| GloVe_0 | GloVe_1 | GloVe_2 | GloVe_3 | GloVe_4 | GloVe_5 | GloVe_6 | GloVe_7 | GloVe_8 | GloVe_9 | ... | Critical Risk_Vehicles and Mobile Equipment | Critical Risk_Venomous Animals | Critical Risk_remains of choco | DayOfWeek_1 | DayOfWeek_2 | DayOfWeek_3 | DayOfWeek_4 | DayOfWeek_5 | DayOfWeek_6 | Accident Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.057628 | 0.065342 | -0.019501 | -0.264583 | -0.140774 | -0.060398 | 0.111248 | -0.036066 | 0.015840 | -0.905868 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | -0.068634 | 0.120895 | -0.046153 | -0.168422 | 0.020937 | -0.106742 | 0.030717 | -0.097282 | 0.066715 | -0.921388 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | -0.038172 | 0.206443 | -0.202828 | -0.156088 | -0.007283 | -0.034272 | -0.191986 | -0.048705 | 0.003676 | -0.814817 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | -0.017094 | 0.038141 | 0.013703 | -0.171292 | -0.056809 | -0.101380 | -0.077591 | 0.000560 | -0.030361 | -0.761708 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | -0.099225 | 0.095072 | -0.123143 | -0.069148 | -0.095534 | -0.048877 | 0.106987 | 0.047991 | 0.026990 | -0.772863 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1540 | -0.006937 | 0.090132 | -0.034344 | -0.150909 | -0.163116 | -0.082174 | 0.009945 | 0.011470 | 0.030997 | -0.899795 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1541 | -0.012192 | 0.077488 | -0.016162 | -0.124923 | -0.112004 | -0.074769 | 0.085889 | -0.023623 | 0.019261 | -0.951245 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1542 | -0.038209 | 0.004577 | 0.033985 | -0.147681 | -0.042825 | 0.000638 | -0.011167 | -0.056092 | -0.022267 | -0.890274 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1543 | -0.107909 | 0.058875 | -0.044034 | -0.162590 | -0.082601 | -0.041802 | 0.060897 | 0.049642 | 0.067454 | -0.808875 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 4 |
| 1544 | -0.026800 | 0.055851 | -0.010890 | -0.175891 | -0.145525 | 0.004113 | 0.026513 | 0.008078 | 0.006551 | -1.020623 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
1545 rows × 357 columns
ISH_NLP_TFIDF_df_Bal
| abb | abdomen | able | abratech | abrupt | abruptly | absorbent | absorbing | abutment | acc | ... | Critical Risk_Vehicles and Mobile Equipment | Critical Risk_Venomous Animals | Critical Risk_remains of choco | DayOfWeek_1 | DayOfWeek_2 | DayOfWeek_3 | DayOfWeek_4 | DayOfWeek_5 | DayOfWeek_6 | Accident Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1540 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1541 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1542 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1543 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 4 |
| 1544 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
1545 rows × 2871 columns
ISH_NLP_Word2Vec_df_Bal
| Word2Vec_0 | Word2Vec_1 | Word2Vec_2 | Word2Vec_3 | Word2Vec_4 | Word2Vec_5 | Word2Vec_6 | Word2Vec_7 | Word2Vec_8 | Word2Vec_9 | ... | Critical Risk_Vehicles and Mobile Equipment | Critical Risk_Venomous Animals | Critical Risk_remains of choco | DayOfWeek_1 | DayOfWeek_2 | DayOfWeek_3 | DayOfWeek_4 | DayOfWeek_5 | DayOfWeek_6 | Accident Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000083 | 0.009217 | 0.000322 | 0.003347 | 0.001474 | -0.008945 | 0.005879 | 0.016230 | 0.003878 | -0.002858 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0.000170 | 0.003141 | 0.001116 | 0.001657 | 0.000701 | -0.003123 | 0.001844 | 0.006504 | 0.000751 | -0.000125 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0.000277 | 0.009596 | 0.000995 | 0.003688 | 0.000974 | -0.009674 | 0.005705 | 0.017936 | 0.003464 | -0.002458 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0.000118 | 0.007303 | 0.000819 | 0.003142 | 0.000470 | -0.007395 | 0.004152 | 0.014255 | 0.003106 | -0.001568 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0.000297 | 0.007635 | 0.000646 | 0.002531 | 0.001687 | -0.006660 | 0.004018 | 0.013523 | 0.002996 | -0.002194 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1540 | 0.000552 | 0.005437 | 0.000652 | 0.002136 | 0.001072 | -0.005427 | 0.002989 | 0.009372 | 0.002120 | -0.001324 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1541 | 0.000323 | 0.007228 | 0.000444 | 0.003578 | 0.001426 | -0.007101 | 0.003694 | 0.012829 | 0.002882 | -0.001772 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1542 | -0.000117 | 0.004237 | 0.000335 | 0.001822 | 0.001081 | -0.004363 | 0.002500 | 0.007899 | 0.001784 | -0.001523 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 1543 | 0.000525 | 0.007563 | 0.000523 | 0.003714 | 0.001145 | -0.007065 | 0.003433 | 0.012680 | 0.002816 | -0.001937 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 4 |
| 1544 | 0.000899 | 0.006915 | 0.000682 | 0.002737 | 0.001042 | -0.006204 | 0.003744 | 0.011427 | 0.002492 | -0.001622 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
1545 rows × 357 columns
#Check for Missing values and duplicates in all the 3 dataframes
# Function to check for missing values and duplicates
def check_data_quality(df, df_name):
missing_values = df.isnull().sum()
duplicates = df.duplicated().sum()
return pd.DataFrame({
'DataFrame': [df_name],
'Missing Values': [missing_values.sum()],
'Duplicates': [duplicates]
})
# Check data quality for each DataFrame
glove_quality = check_data_quality(ISH_NLP_Glove_df_Bal, 'ISH_NLP_Glove_df_Bal')
tfidf_quality = check_data_quality(ISH_NLP_TFIDF_df_Bal, 'ISH_NLP_TFIDF_df_Bal')
word2vec_quality = check_data_quality(ISH_NLP_Word2Vec_df_Bal, 'ISH_NLP_Word2Vec_df_Bal')
# Concatenate results into a single DataFrame
data_quality_summary = pd.concat([glove_quality, tfidf_quality, word2vec_quality], ignore_index=True)
# Display the summary
data_quality_summary
| DataFrame | Missing Values | Duplicates | |
|---|---|---|---|
| 0 | ISH_NLP_Glove_df_Bal | 0 | 0 |
| 1 | ISH_NLP_TFIDF_df_Bal | 0 | 0 |
| 2 | ISH_NLP_Word2Vec_df_Bal | 0 | 0 |
#Rename the final dataframes as Final_NLP_Glove_df, Final_NLP_TFIDF_df & Final_NLP_Word2Vec
Final_NLP_Glove_df = ISH_NLP_Glove_df_Bal.copy()
Final_NLP_TFIDF_df = ISH_NLP_TFIDF_df_Bal.copy()
Final_NLP_Word2Vec_df = ISH_NLP_Word2Vec_df_Bal.copy()
!pip install openpyxl
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (3.1.5) Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl) (1.1.0)
# Export the 3 dataframes in csv and xlsx
# Export to CSV
Final_NLP_Glove_df.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_Glove_df.csv', index=False)
Final_NLP_TFIDF_df.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_TFIDF_df.csv', index=False)
Final_NLP_Word2Vec_df.to_csv('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_Word2Vec_df.csv', index=False)
# Export to Excel
Final_NLP_Glove_df.to_excel('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_Glove_df.xlsx', index=False)
Final_NLP_TFIDF_df.to_excel('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_TFIDF_df.xlsx', index=False)
Final_NLP_Word2Vec_df.to_excel('/content/drive/My Drive/Capstone_Group10_NLP1/Final_NLP_Word2Vec_df.xlsx', index=False)
# Initialise all the known classifiers and to run model on the 3 dataframes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import time
# Initialize classifiers
classifiers = {
"Logistic Regression": LogisticRegression(),
"Support Vector Machine": SVC(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"XG Boost": XGBClassifier(),
"Naive Bayes": GaussianNB(),
"K-Nearest Neighbors": KNeighborsClassifier()
}
# Function to train and evaluate models
def train_and_evaluate(df):
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
results = []
for name, clf in classifiers.items():
start_time = time.time()
clf.fit(X_train, y_train)
training_time = time.time() - start_time
# Train metrics
y_train_pred = clf.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')
train_f1 = f1_score(y_train, y_train_pred, average='weighted')
start_time = time.time()
y_test_pred = clf.predict(X_test)
prediction_time = time.time() - start_time
# Test metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
results.append([name,
train_accuracy, train_precision, train_recall, train_f1,
test_accuracy, test_precision, test_recall, test_f1,
training_time, prediction_time])
return results
# Train and evaluate on each DataFrame
glove_results = train_and_evaluate(Final_NLP_Glove_df)
tfidf_results = train_and_evaluate(Final_NLP_TFIDF_df)
word2vec_results = train_and_evaluate(Final_NLP_Word2Vec_df)
# Create DataFrames for results
columns = ['Classifier',
'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
'Training Time', 'Prediction Time']
glove_df = pd.DataFrame(glove_results, columns=columns)
tfidf_df = pd.DataFrame(tfidf_results, columns=columns)
word2vec_df = pd.DataFrame(word2vec_results, columns=columns)
print("Classification matrix for Glove")
glove_df
Classification matrix for Glove
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.963592 | 0.963436 | 0.963592 | 0.963494 | 0.928803 | 0.933631 | 0.928803 | 0.929641 | 0.126808 | 0.005065 |
| 1 | Support Vector Machine | 0.962783 | 0.963127 | 0.962783 | 0.962850 | 0.912621 | 0.925272 | 0.912621 | 0.914822 | 0.208429 | 0.093478 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.883495 | 0.880778 | 0.883495 | 0.878947 | 0.440358 | 0.003123 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.990291 | 0.990464 | 0.990291 | 0.990265 | 1.671161 | 0.013798 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.971015 | 0.970874 | 0.970540 | 74.470469 | 0.007030 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.974110 | 0.974697 | 0.974110 | 0.973937 | 2.941567 | 0.069770 |
| 6 | Naive Bayes | 0.576052 | 0.686802 | 0.576052 | 0.555990 | 0.576052 | 0.619135 | 0.576052 | 0.560298 | 0.009056 | 0.005299 |
| 7 | K-Nearest Neighbors | 0.850324 | 0.875346 | 0.850324 | 0.825762 | 0.838188 | 0.862608 | 0.838188 | 0.798293 | 0.004603 | 0.019504 |
print("Classification matrix for TFIDF")
tfidf_df
Classification matrix for TFIDF
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.983819 | 0.983927 | 0.983819 | 0.983854 | 0.948220 | 0.954307 | 0.948220 | 0.949603 | 15.831049 | 0.048633 |
| 1 | Support Vector Machine | 0.979773 | 0.980294 | 0.979773 | 0.979849 | 0.925566 | 0.943783 | 0.925566 | 0.929372 | 1.348215 | 0.619280 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.860841 | 0.863785 | 0.860841 | 0.861183 | 0.224136 | 0.016699 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.977346 | 0.979500 | 0.977346 | 0.977712 | 0.622928 | 0.028605 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.919094 | 0.929961 | 0.919094 | 0.922106 | 28.802572 | 0.021820 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.944984 | 0.955475 | 0.944984 | 0.946983 | 5.141534 | 0.517637 |
| 6 | Naive Bayes | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.973284 | 0.970874 | 0.971391 | 0.067816 | 0.033026 |
| 7 | K-Nearest Neighbors | 0.859223 | 0.881303 | 0.859223 | 0.841769 | 0.844660 | 0.845034 | 0.844660 | 0.821537 | 0.034786 | 0.045716 |
print("Classification matrix for Wor2Vec")
word2vec_df
Classification matrix for Wor2Vec
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.679612 | 0.678206 | 0.679612 | 0.675876 | 0.644013 | 0.639175 | 0.644013 | 0.632709 | 0.162962 | 0.005017 |
| 1 | Support Vector Machine | 0.757282 | 0.760561 | 0.757282 | 0.752859 | 0.692557 | 0.705498 | 0.692557 | 0.683242 | 0.207278 | 0.095707 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.815534 | 0.804061 | 0.815534 | 0.805937 | 0.494382 | 0.003046 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.961165 | 0.961158 | 0.961165 | 0.961090 | 1.732327 | 0.014183 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.961165 | 0.961879 | 0.961165 | 0.959731 | 72.054121 | 0.007030 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.964401 | 0.964198 | 0.964401 | 0.963557 | 4.281512 | 0.072283 |
| 6 | Naive Bayes | 0.529935 | 0.593576 | 0.529935 | 0.513007 | 0.537217 | 0.579248 | 0.537217 | 0.527450 | 0.008865 | 0.005488 |
| 7 | K-Nearest Neighbors | 0.839806 | 0.850000 | 0.839806 | 0.829815 | 0.770227 | 0.759622 | 0.770227 | 0.757096 | 0.004843 | 0.019435 |
GloVe Embedding:
TFIDF Features:
Word2Vec Embedding:
Insights:
# Plotting the classification report for all the ML classifers with training and prediction time comparisions.
import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Function to plot classification report and training/prediction times
def plot_results(df, title):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Classification report heatmap
report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
sns.heatmap(report_data, annot=True, cmap='Oranges', fmt='.2f', ax=ax1)
ax1.set_title(f'Classifier Performance - {title}')
# Training and prediction time comparison
df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
ax2.set_title(f'Training and Prediction Time - {title}')
ax2.set_ylabel('Time (seconds)')
plt.tight_layout()
plt.show()
# Plot results for each DataFrame
plot_results(glove_df, 'Glove Embeddings')
plot_results(tfidf_df, 'TF-IDF Embeddings')
plot_results(word2vec_df, 'Word2Vec Embeddings')
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion_matrices(df, df_name):
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle(f'Confusion Matrices for {df_name}', fontsize=16)
for i, (name, clf) in enumerate(classifiers.items()):
row = i // 4
col = i % 4
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(ax=axes[row, col], cmap='Oranges')
axes[row, col].set_title(name)
plt.tight_layout()
plt.show()
plot_confusion_matrices(Final_NLP_Glove_df, 'Glove Embeddings')
plot_confusion_matrices(Final_NLP_TFIDF_df, 'TF-IDF Features')
plot_confusion_matrices(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
Overall Performance:
Glove Embeddings:
TF-IDF Features:
Word2Vec Embeddings:
Class-specific observations:
Model Complexity:
Embedding Effectiveness:
Conclusion:
Train vs Test Confusion Matrices for all Base ML classifiers
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
def plot_train_test_confusion_matrices(df, df_name):
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
fig, axes = plt.subplots(8, 2, figsize=(20, 40))
fig.suptitle(f'Train and Test Confusion Matrices for {df_name}', fontsize=15, y=0.98)
for i, (name, clf) in enumerate(classifiers.items()):
clf.fit(X_train, y_train)
# Train confusion matrix
y_train_pred = clf.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
disp_train.plot(ax=axes[i, 0], cmap='Oranges')
axes[i, 0].set_title(f'{name} (Train)', fontsize=12)
# Test confusion matrix
y_test_pred = clf.predict(X_test)
cm_test = confusion_matrix(y_test, y_test_pred)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
disp_test.plot(ax=axes[i, 1], cmap='Oranges')
axes[i, 1].set_title(f'{name} (Test)', fontsize=12)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
plot_train_test_confusion_matrices(Final_NLP_Glove_df, 'Glove Embeddings')
plot_train_test_confusion_matrices(Final_NLP_TFIDF_df, 'TF-IDF Features')
plot_train_test_confusion_matrices(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
# Apply PCA and scaling
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
def apply_pca_and_split(df, n_components=0.99):
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA
if n_components < 1:
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)
else:
X_pca = X_scaled
# Splitting
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
# Apply to each dataframe
X_train_glove, X_test_glove, y_train_glove, y_test_glove = apply_pca_and_split(Final_NLP_Glove_df)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = apply_pca_and_split(Final_NLP_TFIDF_df)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = apply_pca_and_split(Final_NLP_Word2Vec_df)
# Function to print explained variance rtio and cumulative explained variance for all 3 embeddings
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
def print_pca_variance(df, df_name):
X = df.drop('Accident Level', axis=1)
# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA
pca = PCA()
pca.fit(X_scaled)
# Explained variance ratio and cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
print(f"----- PCA Variance for {df_name} -----")
print("Explained Variance Ratio:", explained_variance_ratio)
print("Cumulative Explained Variance:", cumulative_explained_variance)
# Print PCA variance for each dataframe
print_pca_variance(Final_NLP_Glove_df, 'Glove Embeddings')
print_pca_variance(Final_NLP_TFIDF_df, 'TF-IDF Features')
print_pca_variance(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
----- PCA Variance for Glove Embeddings ----- Explained Variance Ratio: [7.20317111e-02 4.65175696e-02 4.19325297e-02 3.78570967e-02 3.43278728e-02 2.76109930e-02 2.40981168e-02 2.32083296e-02 2.10781865e-02 1.96841542e-02 1.79640688e-02 1.71660976e-02 1.66113876e-02 1.62398730e-02 1.56111610e-02 1.40997594e-02 1.33395325e-02 1.25536773e-02 1.21800988e-02 1.16986432e-02 1.07976577e-02 1.02797316e-02 9.83359726e-03 9.65500515e-03 9.36034959e-03 9.04572584e-03 8.60432309e-03 8.41584537e-03 8.09972967e-03 8.09870985e-03 7.79322196e-03 7.40199429e-03 7.29051728e-03 7.02229131e-03 6.85609497e-03 6.76442455e-03 6.47310905e-03 6.37547647e-03 6.19192301e-03 6.13035662e-03 5.79266663e-03 5.73710220e-03 5.62483682e-03 5.52409489e-03 5.36614476e-03 5.22690039e-03 5.02810371e-03 4.97990209e-03 4.87446350e-03 4.76793040e-03 4.67078464e-03 4.51692654e-03 4.49816123e-03 4.34567866e-03 4.29325915e-03 4.25616739e-03 4.17441973e-03 4.14408768e-03 4.09993186e-03 4.02244674e-03 3.89119471e-03 3.84866025e-03 3.79166406e-03 3.71710731e-03 3.66072612e-03 3.59693922e-03 3.51464063e-03 3.50274350e-03 3.44292855e-03 3.39484462e-03 3.33191189e-03 3.31240294e-03 3.27604108e-03 3.23535145e-03 3.19788993e-03 3.17025259e-03 3.12699098e-03 3.07521712e-03 3.00675569e-03 2.99017235e-03 2.98278472e-03 2.92692469e-03 2.90817371e-03 2.88436525e-03 2.86777057e-03 2.81555222e-03 2.77165807e-03 2.75260950e-03 2.69636455e-03 2.67338843e-03 2.65056585e-03 2.63619457e-03 2.57687170e-03 2.54312400e-03 2.50766049e-03 2.48539733e-03 2.45800256e-03 2.41685585e-03 2.40198634e-03 2.33239599e-03 2.25896152e-03 2.20593936e-03 2.18973689e-03 2.18387311e-03 2.15360609e-03 2.12635457e-03 2.09869241e-03 2.06656000e-03 2.03006378e-03 2.01191158e-03 1.95246846e-03 1.93682680e-03 1.91148096e-03 1.88083421e-03 1.83338124e-03 1.81219938e-03 1.76052600e-03 1.70806600e-03 1.68488971e-03 1.66068102e-03 1.63549512e-03 1.61430848e-03 1.59291749e-03 1.55643395e-03 1.52765032e-03 1.52071825e-03 1.50721353e-03 1.50091431e-03 1.44478121e-03 1.42838130e-03 1.39566340e-03 1.36658365e-03 1.35248618e-03 1.34800798e-03 1.33804516e-03 1.30940844e-03 1.29136283e-03 1.25495398e-03 1.23071919e-03 1.20494633e-03 1.18544306e-03 1.17828379e-03 1.11178930e-03 1.10397762e-03 1.08344460e-03 1.06438220e-03 1.06239892e-03 1.03137199e-03 1.01690112e-03 9.91954547e-04 9.65936267e-04 9.53226901e-04 9.29661569e-04 9.16523161e-04 9.10845018e-04 9.03926197e-04 8.94691294e-04 8.80758448e-04 8.59302814e-04 8.40999950e-04 8.24322085e-04 8.11391725e-04 7.95107717e-04 7.91387571e-04 7.84638655e-04 7.52398341e-04 7.51367480e-04 7.33368819e-04 7.12628238e-04 6.90452238e-04 6.87521180e-04 6.75846603e-04 6.57779289e-04 6.44570666e-04 6.33284522e-04 6.25797636e-04 6.09007679e-04 5.98169067e-04 5.84795158e-04 5.76406699e-04 5.72700077e-04 5.57724726e-04 5.48507477e-04 5.41887691e-04 5.31680283e-04 5.28374712e-04 5.14465594e-04 4.95275228e-04 4.79224585e-04 4.75278761e-04 4.67998825e-04 4.61891326e-04 4.51174851e-04 4.41377439e-04 4.37272747e-04 4.21295727e-04 4.12585539e-04 4.05215580e-04 4.00932462e-04 3.90197830e-04 3.83858005e-04 3.76746971e-04 3.67226067e-04 3.65619879e-04 3.56171718e-04 3.50550253e-04 3.46221161e-04 3.42212875e-04 3.31026233e-04 3.21635817e-04 3.17322342e-04 3.10520680e-04 3.07083321e-04 2.90563762e-04 2.89170173e-04 2.83672668e-04 2.78578243e-04 2.72299252e-04 2.68554685e-04 2.58792605e-04 2.54525350e-04 2.50296060e-04 2.48010155e-04 2.41602202e-04 2.39883368e-04 2.28738497e-04 2.24359204e-04 2.22325777e-04 2.15595109e-04 2.12811054e-04 2.09110824e-04 2.02693286e-04 1.97037178e-04 1.92602140e-04 1.89666561e-04 1.82776474e-04 1.77949409e-04 1.70923318e-04 1.68650068e-04 1.65178696e-04 1.58737983e-04 1.55208756e-04 1.47735146e-04 1.44875725e-04 1.42550233e-04 1.39848178e-04 1.37750178e-04 1.31660907e-04 1.30257807e-04 1.29506766e-04 1.23289390e-04 1.22005850e-04 1.14340126e-04 1.08876927e-04 1.06566141e-04 1.02855019e-04 1.01916657e-04 9.98268542e-05 9.68524176e-05 9.62537234e-05 9.08935384e-05 8.98299573e-05 8.84619687e-05 8.45322207e-05 8.36264987e-05 8.03716427e-05 7.90076488e-05 7.27512640e-05 7.00190574e-05 6.91306027e-05 6.18187284e-05 5.82387895e-05 5.62393030e-05 5.57377292e-05 5.36866073e-05 4.99370173e-05 4.78146365e-05 4.66553761e-05 4.62345465e-05 4.22181967e-05 3.90795224e-05 3.85543358e-05 3.76834849e-05 3.58844983e-05 3.39390399e-05 3.06873068e-05 2.94916246e-05 2.78382063e-05 2.62328885e-05 2.32323567e-05 2.21686997e-05 2.11833888e-05 1.91323175e-05 1.78640776e-05 1.73293040e-05 1.62941922e-05 1.37458469e-05 1.31271704e-05 1.28447791e-05 1.24015882e-05 1.11975384e-05 1.07357065e-05 1.01559467e-05 9.15042727e-06 8.86035503e-06 8.59165339e-06 8.54101207e-06 8.18835555e-06 7.75002307e-06 7.36045758e-06 6.77137845e-06 6.34230327e-06 6.17330692e-06 5.85082488e-06 5.63287195e-06 5.45484168e-06 5.29399914e-06 4.78222363e-06 4.54330632e-06 4.27609234e-06 4.08684329e-06 3.94724159e-06 3.75574863e-06 3.61297717e-06 3.51736936e-06 3.26049965e-06 3.17010062e-06 3.04983295e-06 2.89106945e-06 2.79758847e-06 2.62647808e-06 2.42841384e-06 2.27937530e-06 2.24163441e-06 2.14341511e-06 1.97278516e-06 1.84866233e-06 1.73293620e-06 1.60157990e-06 1.54164556e-06 1.45009976e-06 1.36741607e-06 1.27984283e-06 1.17047980e-06 1.09142915e-06 1.00797678e-06 9.89665156e-07 9.41029892e-07 8.65315990e-07 8.27806616e-07 6.29957211e-07 5.94879497e-07 5.41900859e-07 4.87323026e-07 1.42057502e-32 2.61955624e-34] Cumulative Explained Variance: [0.07203171 0.11854928 0.16048181 0.19833891 0.23266678 0.26027777 0.28437589 0.30758422 0.32866241 0.34834656 0.36631063 0.38347673 0.40008811 0.41632799 0.43193915 0.44603891 0.45937844 0.47193212 0.48411222 0.49581086 0.50660852 0.51688825 0.52672185 0.53637685 0.5457372 0.55478293 0.56338725 0.57180309 0.57990282 0.58800153 0.59579476 0.60319675 0.61048727 0.61750956 0.62436565 0.63113008 0.63760319 0.64397866 0.65017059 0.65630094 0.66209361 0.66783071 0.67345555 0.67897964 0.68434579 0.68957269 0.69460079 0.6995807 0.70445516 0.70922309 0.71389387 0.7184108 0.72290896 0.72725464 0.7315479 0.73580407 0.73997849 0.74412257 0.74822251 0.75224495 0.75613615 0.75998481 0.76377647 0.76749358 0.77115431 0.77475124 0.77826589 0.78176863 0.78521156 0.7886064 0.79193831 0.79525072 0.79852676 0.80176211 0.80496 0.80813025 0.81125724 0.81433246 0.81733922 0.82032939 0.82331217 0.8262391 0.82914727 0.83203164 0.83489941 0.83771496 0.84048662 0.84323923 0.84593559 0.84860898 0.85125955 0.85389574 0.85647261 0.85901574 0.8615234 0.86400879 0.8664668 0.86888365 0.87128564 0.87361803 0.875877 0.87808293 0.88027267 0.88245654 0.88461015 0.88673651 0.8888352 0.89090176 0.89293182 0.89494373 0.8968962 0.89883303 0.90074451 0.90262534 0.90445873 0.90627092 0.90803145 0.90973952 0.91142441 0.91308509 0.91472058 0.91633489 0.91792781 0.91948424 0.92101189 0.92253261 0.92403982 0.92554074 0.92698552 0.9284139 0.92980956 0.93117615 0.93252863 0.93387664 0.93521469 0.9365241 0.93781546 0.93907041 0.94030113 0.94150608 0.94269152 0.94386981 0.94498159 0.94608557 0.94716902 0.9482334 0.9492958 0.95032717 0.95134407 0.95233603 0.95330196 0.95425519 0.95518485 0.95610137 0.95701222 0.95791614 0.95881084 0.95969159 0.9605509 0.9613919 0.96221622 0.96302761 0.96382272 0.96461411 0.96539874 0.96615114 0.96690251 0.96763588 0.96834851 0.96903896 0.96972648 0.97040233 0.97106011 0.97170468 0.97233796 0.97296376 0.97357277 0.97417094 0.97475573 0.97533214 0.97590484 0.97646256 0.97701107 0.97755296 0.97808464 0.97861301 0.97912748 0.97962275 0.98010198 0.98057726 0.98104526 0.98150715 0.98195832 0.9823997 0.98283697 0.98325827 0.98367085 0.98407607 0.984477 0.9848672 0.98525106 0.9856278 0.98599503 0.98636065 0.98671682 0.98706737 0.98741359 0.98775581 0.98808683 0.98840847 0.98872579 0.98903631 0.9893434 0.98963396 0.98992313 0.9902068 0.99048538 0.99075768 0.99102623 0.99128503 0.99153955 0.99178985 0.99203786 0.99227946 0.99251934 0.99274808 0.99297244 0.99319477 0.99341036 0.99362317 0.99383228 0.99403498 0.99423201 0.99442462 0.99461428 0.99479706 0.99497501 0.99514593 0.99531458 0.99547976 0.9956385 0.99579371 0.99594144 0.99608632 0.99622887 0.99636872 0.99650647 0.99663813 0.99676839 0.99689789 0.99702118 0.99714319 0.99725753 0.99736641 0.99747297 0.99757583 0.99767774 0.99777757 0.99787442 0.99797068 0.99806157 0.9981514 0.99823986 0.99832439 0.99840802 0.99848839 0.9985674 0.99864015 0.99871017 0.9987793 0.99884112 0.99889936 0.9989556 0.99901134 0.99906502 0.99911496 0.99916277 0.99920943 0.99925566 0.99929788 0.99933696 0.99937552 0.9994132 0.99944908 0.99948302 0.99951371 0.9995432 0.99957104 0.99959727 0.9996205 0.99964267 0.99966386 0.99968299 0.99970085 0.99971818 0.99973448 0.99974822 0.99976135 0.99977419 0.9997866 0.99979779 0.99980853 0.99981869 0.99982784 0.9998367 0.99984529 0.99985383 0.99986202 0.99986977 0.99987713 0.9998839 0.99989024 0.99989641 0.99990227 0.9999079 0.99991335 0.99991865 0.99992343 0.99992797 0.99993225 0.99993634 0.99994028 0.99994404 0.99994765 0.99995117 0.99995443 0.9999576 0.99996065 0.99996354 0.99996634 0.99996896 0.99997139 0.99997367 0.99997591 0.99997806 0.99998003 0.99998188 0.99998361 0.99998521 0.99998675 0.9999882 0.99998957 0.99999085 0.99999202 0.99999311 0.99999412 0.99999511 0.99999605 0.99999692 0.99999775 0.99999838 0.99999897 0.99999951 1. 1. 1. ] ----- PCA Variance for TF-IDF Features ----- Explained Variance Ratio: [1.19298130e-02 9.61007571e-03 9.19375412e-03 ... 4.19480060e-37 1.49772201e-37 1.80147440e-38] Cumulative Explained Variance: [0.01192981 0.02153989 0.03073364 ... 1. 1. 1. ] ----- PCA Variance for Word2Vec Embeddings ----- Explained Variance Ratio: [5.10388407e-01 2.61003976e-02 1.54728890e-02 1.32227040e-02 1.19191920e-02 1.10886645e-02 9.93408236e-03 9.54847844e-03 8.92782940e-03 8.65062112e-03 7.92034070e-03 7.56025895e-03 7.41779822e-03 7.08461307e-03 6.77909733e-03 6.63579529e-03 6.19449721e-03 5.97810825e-03 5.93345293e-03 5.71319373e-03 5.53442441e-03 5.42255510e-03 5.26373756e-03 5.08087961e-03 4.94210258e-03 4.65259688e-03 4.54814625e-03 4.50463237e-03 4.37643502e-03 4.27792447e-03 4.22840490e-03 4.15773331e-03 4.13844009e-03 3.97739265e-03 3.94282839e-03 3.85316847e-03 3.82494825e-03 3.73301080e-03 3.66500094e-03 3.60800562e-03 3.54740802e-03 3.51827104e-03 3.45877178e-03 3.42514390e-03 3.35914916e-03 3.30224857e-03 3.25390999e-03 3.24354462e-03 3.22275007e-03 3.16198782e-03 3.12691139e-03 3.09182082e-03 3.06935863e-03 3.05696104e-03 3.01198692e-03 2.99102760e-03 2.94784495e-03 2.92243316e-03 2.91774084e-03 2.88554451e-03 2.86831773e-03 2.81362545e-03 2.77275070e-03 2.74501060e-03 2.72882900e-03 2.68136706e-03 2.65780814e-03 2.62195016e-03 2.57409399e-03 2.54845987e-03 2.52667198e-03 2.50279943e-03 2.43298741e-03 2.39359991e-03 2.31694324e-03 2.29447146e-03 2.24336044e-03 2.23016892e-03 2.18008843e-03 2.16199087e-03 2.12300455e-03 2.05770704e-03 2.00094706e-03 1.96375560e-03 1.93476682e-03 1.89653174e-03 1.88899471e-03 1.81938280e-03 1.77028228e-03 1.72445277e-03 1.71287818e-03 1.66809687e-03 1.64529891e-03 1.59483460e-03 1.58504443e-03 1.56393772e-03 1.53895722e-03 1.52122404e-03 1.49401481e-03 1.45881928e-03 1.41679380e-03 1.40032132e-03 1.36343228e-03 1.34220078e-03 1.32976957e-03 1.31956162e-03 1.28249282e-03 1.24269597e-03 1.23446317e-03 1.20390384e-03 1.18233425e-03 1.14415907e-03 1.12021527e-03 1.11606943e-03 1.07077153e-03 1.05625710e-03 1.04728275e-03 1.03188193e-03 9.94400095e-04 9.77070909e-04 9.65983033e-04 9.24739993e-04 9.09762786e-04 8.82136766e-04 8.80826681e-04 8.60973099e-04 8.40978079e-04 8.20730746e-04 8.11956929e-04 8.06231988e-04 7.83365648e-04 7.75501542e-04 7.52623171e-04 7.37509216e-04 7.16555096e-04 7.10672982e-04 7.04005223e-04 6.86639688e-04 6.73340610e-04 6.52247084e-04 6.38244572e-04 6.31821927e-04 6.08033123e-04 6.01460529e-04 5.75209378e-04 5.69399856e-04 5.61805292e-04 5.53845479e-04 5.41862315e-04 5.29982988e-04 5.18176511e-04 5.08276878e-04 5.03195729e-04 4.95732828e-04 4.83481565e-04 4.66629639e-04 4.61009036e-04 4.55223295e-04 4.48559219e-04 4.35896411e-04 4.32849743e-04 4.28155951e-04 4.25389186e-04 4.09671318e-04 3.94710315e-04 3.89874711e-04 3.80771096e-04 3.75834506e-04 3.72753979e-04 3.67714001e-04 3.57543955e-04 3.48109247e-04 3.40424181e-04 3.34936954e-04 3.34659771e-04 3.26837179e-04 3.17576123e-04 3.13464826e-04 3.08917026e-04 2.99791679e-04 2.91522129e-04 2.87598747e-04 2.84253602e-04 2.77747517e-04 2.68116919e-04 2.62891345e-04 2.58759498e-04 2.51827456e-04 2.48325346e-04 2.45776379e-04 2.35330602e-04 2.30403851e-04 2.28017187e-04 2.24629677e-04 2.20882074e-04 2.17728797e-04 2.08038002e-04 2.06284678e-04 2.00700618e-04 1.96607768e-04 1.93829460e-04 1.86340383e-04 1.84366784e-04 1.78711714e-04 1.74439325e-04 1.70592132e-04 1.66661893e-04 1.65127186e-04 1.63626724e-04 1.60538521e-04 1.58760906e-04 1.55556727e-04 1.53124022e-04 1.50493605e-04 1.46568799e-04 1.40380526e-04 1.39112443e-04 1.35087144e-04 1.34102536e-04 1.32042381e-04 1.28559930e-04 1.26945326e-04 1.25018919e-04 1.21475810e-04 1.20644987e-04 1.17735265e-04 1.13987597e-04 1.12125183e-04 1.10666310e-04 1.08613221e-04 1.05386484e-04 1.03624325e-04 1.01590033e-04 1.00068501e-04 9.75978368e-05 9.49081536e-05 9.15818178e-05 9.02792895e-05 8.82300732e-05 8.69994698e-05 8.53801006e-05 8.50584547e-05 8.07367017e-05 7.93194994e-05 7.72509880e-05 7.51160991e-05 7.38378779e-05 7.28663149e-05 7.08551110e-05 7.05982106e-05 6.95166664e-05 6.71330135e-05 6.46645247e-05 6.32140350e-05 6.30399531e-05 6.14390303e-05 5.93603371e-05 5.80894529e-05 5.73401533e-05 5.64965735e-05 5.46996174e-05 5.31110285e-05 5.26819992e-05 5.05019206e-05 4.97843171e-05 4.86769005e-05 4.71189501e-05 4.55860049e-05 4.52795115e-05 4.38056346e-05 4.13281003e-05 4.07578702e-05 3.92627677e-05 3.86047395e-05 3.69453583e-05 3.60514086e-05 3.57665496e-05 3.52932482e-05 3.34968465e-05 3.31235233e-05 3.21153257e-05 3.18556010e-05 3.02499566e-05 2.94357150e-05 2.89359296e-05 2.82141789e-05 2.73263473e-05 2.62064496e-05 2.56404826e-05 2.51750016e-05 2.49728130e-05 2.40750927e-05 2.35845123e-05 2.27150826e-05 2.23629972e-05 2.18625855e-05 2.10680237e-05 2.02133354e-05 2.00445152e-05 1.95049282e-05 1.88633283e-05 1.82315264e-05 1.79355118e-05 1.76784102e-05 1.64655844e-05 1.61209990e-05 1.58314464e-05 1.44673336e-05 1.42448085e-05 1.40693877e-05 1.38175715e-05 1.34807684e-05 1.30000516e-05 1.26471023e-05 1.21033549e-05 1.18192364e-05 1.16306301e-05 1.13912231e-05 1.06103298e-05 1.02651321e-05 9.86587802e-06 9.26418713e-06 9.08015628e-06 8.64719349e-06 8.53371100e-06 8.31532111e-06 7.83217961e-06 7.43691549e-06 7.27757710e-06 6.81048647e-06 6.60482732e-06 6.02340880e-06 5.87818728e-06 5.61736521e-06 5.36891518e-06 5.35160949e-06 4.96842112e-06 4.78553439e-06 4.48345734e-06 4.29093896e-06 4.21584309e-06 3.95500876e-06 3.75654262e-06 3.67693271e-06 3.56307340e-06 3.28124028e-06 3.08336358e-06 2.81599104e-06 2.69902689e-06 2.48816254e-06 2.27172130e-06 2.03283951e-06 1.90966124e-06 1.61863106e-06 1.30949679e-06 1.00656365e-31] Cumulative Explained Variance: [0.51038841 0.5364888 0.55196169 0.5651844 0.57710359 0.58819225 0.59812634 0.60767482 0.61660264 0.62525327 0.63317361 0.64073387 0.64815166 0.65523628 0.66201537 0.66865117 0.67484567 0.68082377 0.68675723 0.69247042 0.69800485 0.7034274 0.70869114 0.71377202 0.71871412 0.72336672 0.72791486 0.7324195 0.73679593 0.74107386 0.74530226 0.74945999 0.75359843 0.75757583 0.76151865 0.76537182 0.76919677 0.77292978 0.77659478 0.78020279 0.7837502 0.78726847 0.79072724 0.79415238 0.79751153 0.80081378 0.80406769 0.80731124 0.81053399 0.81369597 0.81682289 0.81991471 0.82298406 0.82604103 0.82905301 0.83204404 0.83499189 0.83791432 0.84083206 0.8437176 0.84658592 0.84939955 0.8521723 0.85491731 0.85764614 0.8603275 0.86298531 0.86560726 0.86818136 0.87072982 0.87325649 0.87575929 0.87819228 0.88058588 0.88290282 0.88519729 0.88744065 0.88967082 0.89185091 0.8940129 0.8961359 0.89819361 0.90019456 0.90215831 0.90409308 0.90598961 0.90787861 0.90969799 0.91146827 0.91319272 0.9149056 0.9165737 0.918219 0.91981383 0.92139888 0.92296281 0.92450177 0.926023 0.92751701 0.92897583 0.93039262 0.93179295 0.93315638 0.93449858 0.93582835 0.93714791 0.9384304 0.9396731 0.94090756 0.94211147 0.9432938 0.94443796 0.94555817 0.94667424 0.94774501 0.94880127 0.94984855 0.95088044 0.95187484 0.95285191 0.95381789 0.95474263 0.95565239 0.95653453 0.95741536 0.95827633 0.95911731 0.95993804 0.96075 0.96155623 0.96233959 0.96311509 0.96386772 0.96460523 0.96532178 0.96603246 0.96673646 0.9674231 0.96809644 0.96874869 0.96938693 0.97001875 0.97062679 0.97122825 0.97180346 0.97237286 0.97293466 0.97348851 0.97403037 0.97456035 0.97507853 0.97558681 0.97609 0.97658574 0.97706922 0.97753585 0.97799686 0.97845208 0.97890064 0.97933653 0.97976938 0.98019754 0.98062293 0.9810326 0.98142731 0.98181719 0.98219796 0.98257379 0.98294655 0.98331426 0.9836718 0.98401991 0.98436034 0.98469527 0.98502993 0.98535677 0.98567435 0.98598781 0.98629673 0.98659652 0.98688804 0.98717564 0.98745989 0.98773764 0.98800576 0.98826865 0.98852741 0.98877924 0.98902756 0.98927334 0.98950867 0.98973907 0.98996709 0.99019172 0.9904126 0.99063033 0.99083837 0.99104465 0.99124535 0.99144196 0.99163579 0.99182213 0.9920065 0.99218521 0.99235965 0.99253024 0.9926969 0.99286203 0.99302566 0.9931862 0.99334496 0.99350051 0.99365364 0.99380413 0.9939507 0.99409108 0.99423019 0.99436528 0.99449938 0.99463143 0.99475999 0.99488693 0.99501195 0.99513343 0.99525407 0.99537181 0.99548579 0.99559792 0.99570858 0.9958172 0.99592258 0.99602621 0.9961278 0.99622787 0.99632547 0.99642037 0.99651196 0.99660223 0.99669046 0.99677746 0.99686284 0.9969479 0.99702864 0.99710796 0.99718521 0.99726033 0.99733416 0.99740703 0.99747789 0.99754848 0.997618 0.99768513 0.9977498 0.99781301 0.99787605 0.99793749 0.99799685 0.99805494 0.99811228 0.99816878 0.99822348 0.99827659 0.99832927 0.99837977 0.99842956 0.99847823 0.99852535 0.99857094 0.99861622 0.99866002 0.99870135 0.99874211 0.99878137 0.99881998 0.99885692 0.99889297 0.99892874 0.99896403 0.99899753 0.99903065 0.99906277 0.99909462 0.99912487 0.99915431 0.99918325 0.99921146 0.99923879 0.99926499 0.99929063 0.99931581 0.99934078 0.99936486 0.99938844 0.99941116 0.99943352 0.99945538 0.99947645 0.99949666 0.99951671 0.99953621 0.99955508 0.99957331 0.99959124 0.99960892 0.99962539 0.99964151 0.99965734 0.99967181 0.99968605 0.99970012 0.99971394 0.99972742 0.99974042 0.99975307 0.99976517 0.99977699 0.99978862 0.99980001 0.99981062 0.99982089 0.99983075 0.99984002 0.9998491 0.99985774 0.99986628 0.99987459 0.99988242 0.99988986 0.99989714 0.99990395 0.99991055 0.99991658 0.99992246 0.99992807 0.99993344 0.99993879 0.99994376 0.99994855 0.99995303 0.99995732 0.99996154 0.99996549 0.99996925 0.99997293 0.99997649 0.99997977 0.99998285 0.99998567 0.99998837 0.99999086 0.99999313 0.99999516 0.99999707 0.99999869 1. 1. ]
def plot_cumulative_variance(df, df_name, threshold=0.99):
X = df.drop('Accident Level', axis=1)
# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA
pca = PCA()
pca.fit(X_scaled)
# Explained variance ratio and cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Find number of components for threshold
n_components_at_threshold = np.argmax(cumulative_explained_variance >= threshold) + 1
# Plotting
plt.figure(figsize=(10, 5))
plt.plot(np.arange(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance)
plt.axhline(y=threshold, color='g', linestyle='--')
plt.text(n_components_at_threshold, threshold, f"{n_components_at_threshold}", color='green')
plt.title(f'Cumulative Explained Variance vs. Principal Components ({df_name})')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()
# Plot for each dataframe
plot_cumulative_variance(Final_NLP_Glove_df, 'Glove Embeddings')
plt.subplots_adjust(wspace=0.5) # Add spacing between plots
plot_cumulative_variance(Final_NLP_TFIDF_df, 'TF-IDF Features')
plt.subplots_adjust(wspace=0.5)
plot_cumulative_variance(Final_NLP_Word2Vec_df, 'Word2Vec Embeddings')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
# Train and evaluate classifiers with PCA components
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import time
# Initialize classifiers
classifiers = {
"Logistic Regression": LogisticRegression(),
"Support Vector Machine": SVC(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"XG Boost": XGBClassifier(),
"Naive Bayes": GaussianNB(),
"K-Nearest Neighbors": KNeighborsClassifier()
}
# Function to train and evaluate models (modified for PCA data)
def train_and_evaluate_pca(X_train, X_test, y_train, y_test):
results = []
for name, clf in classifiers.items():
start_time = time.time()
clf.fit(X_train, y_train)
training_time = time.time() - start_time
# Train metrics
y_train_pred = clf.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')
train_f1 = f1_score(y_train, y_train_pred, average='weighted')
start_time = time.time()
y_test_pred = clf.predict(X_test)
prediction_time = time.time() - start_time
# Test metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
results.append([name,
train_accuracy, train_precision, train_recall, train_f1,
test_accuracy, test_precision, test_recall, test_f1,
training_time, prediction_time])
return results
# Train and evaluate on each PCA-transformed dataset
glove_results_pca = train_and_evaluate_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove)
tfidf_results_pca = train_and_evaluate_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf)
word2vec_results_pca = train_and_evaluate_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec)
# Create DataFrames for results
columns = ['Classifier',
'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
'Training Time', 'Prediction Time']
glove_df_pca = pd.DataFrame(glove_results_pca, columns=columns)
tfidf_df_pca = pd.DataFrame(tfidf_results_pca, columns=columns)
word2vec_df_pca = pd.DataFrame(word2vec_results_pca, columns=columns)
print("Classification matrix for Glove (PCA)")
glove_df_pca
Classification matrix for Glove (PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.948220 | 0.948514 | 0.948220 | 0.948278 | 0.105575 | 0.000378 |
| 1 | Support Vector Machine | 0.993528 | 0.993549 | 0.993528 | 0.993527 | 0.967638 | 0.970904 | 0.967638 | 0.968258 | 0.167613 | 0.056506 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.776699 | 0.777464 | 0.776699 | 0.774070 | 0.329336 | 0.000332 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.954693 | 0.959536 | 0.954693 | 0.955472 | 1.775282 | 0.011454 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.980583 | 0.982138 | 0.980583 | 0.980819 | 53.698463 | 0.004048 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.973752 | 0.970874 | 0.971284 | 1.436882 | 0.003529 |
| 6 | Naive Bayes | 0.907767 | 0.909527 | 0.907767 | 0.906243 | 0.834951 | 0.845118 | 0.834951 | 0.835399 | 0.003593 | 0.001635 |
| 7 | K-Nearest Neighbors | 0.842233 | 0.873798 | 0.842233 | 0.806267 | 0.877023 | 0.900970 | 0.877023 | 0.844946 | 0.000774 | 0.003505 |
print("\nClassification matrix for TFIDF (PCA)")
tfidf_df_pca
Classification matrix for TFIDF (PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.998382 | 0.998385 | 0.998382 | 0.998382 | 0.957929 | 0.959284 | 0.957929 | 0.956248 | 0.119481 | 0.000438 |
| 1 | Support Vector Machine | 0.987864 | 0.988173 | 0.987864 | 0.987872 | 0.993528 | 0.993700 | 0.993528 | 0.993541 | 0.188764 | 0.080477 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.906149 | 0.903889 | 0.906149 | 0.903676 | 0.600765 | 0.000409 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.983819 | 0.984182 | 0.983819 | 0.983635 | 2.135473 | 0.011151 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.983819 | 0.984328 | 0.983819 | 0.983830 | 97.498487 | 0.003880 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.977346 | 0.977723 | 0.977346 | 0.977362 | 4.031977 | 0.001467 |
| 6 | Naive Bayes | 0.789644 | 0.819981 | 0.789644 | 0.779825 | 0.786408 | 0.810530 | 0.786408 | 0.784380 | 0.005635 | 0.002616 |
| 7 | K-Nearest Neighbors | 0.851133 | 0.908750 | 0.851133 | 0.830634 | 0.851133 | 0.907750 | 0.851133 | 0.801892 | 0.000835 | 0.004656 |
print("\nClassification matrix for Word2Vec (PCA)")
word2vec_df_pca
Classification matrix for Word2Vec (PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.941748 | 0.943847 | 0.941748 | 0.942467 | 0.095108 | 0.000333 |
| 1 | Support Vector Machine | 0.987864 | 0.987891 | 0.987864 | 0.987851 | 0.964401 | 0.969988 | 0.964401 | 0.965378 | 0.140915 | 0.054962 |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.805825 | 0.807284 | 0.805825 | 0.805828 | 0.321802 | 0.000337 |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.948220 | 0.956001 | 0.948220 | 0.949430 | 1.656468 | 0.011492 |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.972674 | 0.970874 | 0.971157 | 48.383260 | 0.003917 |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.967638 | 0.969718 | 0.967638 | 0.967933 | 1.549905 | 0.001610 |
| 6 | Naive Bayes | 0.907767 | 0.911257 | 0.907767 | 0.907027 | 0.844660 | 0.856784 | 0.844660 | 0.844113 | 0.003453 | 0.001562 |
| 7 | K-Nearest Neighbors | 0.869741 | 0.887881 | 0.869741 | 0.848454 | 0.867314 | 0.884397 | 0.867314 | 0.833406 | 0.000745 | 0.002465 |
GloVe Embedding with PCA:
TFIDF Features with PCA:
Word2Vec Embedding with PCA:
Insights and Comparison:
# Function to plot classification report and training/prediction times
def plot_results(df, title):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Classification report heatmap
report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
sns.heatmap(report_data, annot=True, cmap='Purples', fmt='.2f', ax=ax1)
ax1.set_title(f'Classifier Performance - {title}')
# Training and prediction time comparison
df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
ax2.set_title(f'Training and Prediction Time - {title}')
ax2.set_ylabel('Time (seconds)')
plt.tight_layout()
plt.show()
# Plot results for each DataFrame (with PCA)
plot_results(glove_df_pca, 'Glove Embeddings (PCA)')
plot_results(tfidf_df_pca, 'TF-IDF Embeddings (PCA)')
plot_results(word2vec_df_pca, 'Word2Vec Embeddings (PCA)')
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec with PCA
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion_matrices_pca(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle(f'Confusion Matrices for {df_name} (PCA)', fontsize=16)
for i, (name, clf) in enumerate(classifiers.items()):
row = i // 4
col = i % 4
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(ax=axes[row, col], cmap='Purples')
axes[row, col].set_title(name)
plt.tight_layout()
plt.show()
plot_confusion_matrices_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_confusion_matrices_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_confusion_matrices_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
Overall Performance:
Glove Embeddings with PCA:
TF-IDF Features with PCA:
Word2Vec Embeddings with PCA:
Class-specific observations:
Model Complexity:
Embedding Effectiveness with PCA:
Conclusion
Train vs Test Confusion Matrices for all ML classifiers with PCA
def plot_train_test_confusion_matrices_pca(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(8, 2, figsize=(20, 40))
fig.suptitle(f'Train and Test Confusion Matrices for {df_name} (PCA)', fontsize=15, y=0.98)
for i, (name, clf) in enumerate(classifiers.items()):
clf.fit(X_train, y_train)
# Train confusion matrix
y_train_pred = clf.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
disp_train.plot(ax=axes[i, 0], cmap='Purples')
axes[i, 0].set_title(f'{name} (Train)', fontsize=12)
# Test confusion matrix
y_test_pred = clf.predict(X_test)
cm_test = confusion_matrix(y_test, y_test_pred)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
disp_test.plot(ax=axes[i, 1], cmap='Purples')
axes[i, 1].set_title(f'{name} (Test)', fontsize=12)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
plot_train_test_confusion_matrices_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_train_test_confusion_matrices_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_train_test_confusion_matrices_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
# Applying Hypertuning to all the classifers and run without PCA
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
import time
# Prepare data
X_glove = Final_NLP_Glove_df.drop('Accident Level', axis=1)
y_glove = Final_NLP_Glove_df['Accident Level']
X_tfidf = Final_NLP_TFIDF_df.drop('Accident Level', axis=1)
y_tfidf = Final_NLP_TFIDF_df['Accident Level']
X_word2vec = Final_NLP_Word2Vec_df.drop('Accident Level', axis=1)
y_word2vec = Final_NLP_Word2Vec_df['Accident Level']
# Split data
X_train_glove, X_test_glove, y_train_glove, y_test_glove = train_test_split(X_glove, y_glove, test_size=0.2, random_state=42)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, y_tfidf, test_size=0.2, random_state=42)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = train_test_split(X_word2vec, y_word2vec, test_size=0.2, random_state=42)
# Define classifiers and hyperparameter grids
classifiers = {
"Logistic Regression": (LogisticRegression(), {
'penalty': ['l1', 'l2'],
'C': [0.01, 0.1, 1, 10],
'solver': ['liblinear', 'saga'],
'max_iter': [100, 500, 1000]
}),
"Support Vector Machine": (SVC(), {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf', 'poly'],
'gamma': ['scale', 'auto'],
'class_weight': ['balanced', None],
'max_iter': [1000, 5000, 10000]
}),
"Decision Tree": (DecisionTreeClassifier(), {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}),
"Random Forest": (RandomForestClassifier(), {
'n_estimators': [50, 100, 200],
'criterion': ['gini', 'entropy'],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['auto', 'sqrt']
}),
"Gradient Boosting": (GradientBoostingClassifier(), {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'n_iter_no_change': [5],
'validation_fraction': [0.1, 0.2]
}),
"XG Boost": (XGBClassifier(), {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0]
}),
"Naive Bayes": (GaussianNB(), {}), # No hyperparameters for GaussianNB
"K-Nearest Neighbors": (KNeighborsClassifier(), {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'p': [1, 2]
})
}
# Scoring metrics
scoring = {
'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score, average='weighted'),
'recall': make_scorer(recall_score, average='weighted'),
'f1': make_scorer(f1_score, average='weighted')
}
# Function to perform hyperparameter tuning and evaluation
def tune_and_evaluate(X_train, X_test, y_train, y_test, embedding_name):
results = []
for name, (clf, param_grid) in classifiers.items():
start_time = time.time()
# Use RandomizedSearchCV for efficiency with large param grids
grid_search = RandomizedSearchCV(clf, param_grid, cv=5, scoring=scoring, refit='f1', n_jobs=-1, verbose=2, random_state=42)
grid_search.fit(X_train, y_train)
training_time = time.time() - start_time
best_clf = grid_search.best_estimator_
# Train metrics (using best estimator)
y_train_pred = best_clf.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')
train_f1 = f1_score(y_train, y_train_pred, average='weighted')
start_time = time.time()
y_test_pred = best_clf.predict(X_test)
prediction_time = time.time() - start_time
# Test metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
results.append([name,
train_accuracy, train_precision, train_recall, train_f1,
test_accuracy, test_precision, test_recall, test_f1,
training_time, prediction_time, grid_search.best_params_])
# Create DataFrame and print results
columns = ['Classifier',
'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
'Training Time', 'Prediction Time', 'Best Parameters']
df = pd.DataFrame(results, columns=columns)
print(f"----- Results for {embedding_name} -----")
print(df)
return df
# Tune and evaluate for each embedding
glove_results = tune_and_evaluate(X_train_glove, X_test_glove, y_train_glove, y_test_glove, "Glove")
tfidf_results = tune_and_evaluate(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, "TF-IDF")
word2vec_results = tune_and_evaluate(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, "Word2Vec")
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for Glove -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.997573 0.997576 0.997573
1 Support Vector Machine 0.997573 0.997583 0.997573
2 Decision Tree 0.999191 0.999194 0.999191
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.994337 0.994340 0.994337
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.576052 0.686802 0.576052
7 K-Nearest Neighbors 0.999191 0.999194 0.999191
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.997573 0.941748 0.941760 0.941748 0.941599
1 0.997573 0.964401 0.969511 0.964401 0.965489
2 0.999191 0.857605 0.853864 0.857605 0.855199
3 0.999191 0.987055 0.987190 0.987055 0.987094
4 0.994327 0.974110 0.974303 0.974110 0.973654
5 0.999191 0.977346 0.977750 0.977346 0.977273
6 0.555990 0.576052 0.619135 0.576052 0.560298
7 0.999191 0.873786 0.895224 0.873786 0.838124
Training Time Prediction Time \
0 11.900393 0.005751
1 2.059348 0.069366
2 2.920077 0.003244
3 11.566955 0.024556
4 304.612072 0.005964
5 115.302313 0.071594
6 0.167307 0.005420
7 0.701028 0.043943
Best Parameters
0 {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 500, 'C': 10}
1 {'max_iter': 10000, 'kernel': 'rbf', 'gamma': 'scale', 'class_weight': 'balanced', 'C': 10}
2 {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'criterion': 'gini'}
3 {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'gini'}
4 {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 5, 'learning_rate': 0.2}
5 {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 0.9}
6 {}
7 {'weights': 'distance', 'p': 1, 'n_neighbors': 3}
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for TF-IDF -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.998382 0.998385 0.998382
1 Support Vector Machine 0.997573 0.997579 0.997573
2 Decision Tree 0.999191 0.999194 0.999191
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.995146 0.995153 0.995146
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.999191 0.999194 0.999191
7 K-Nearest Neighbors 0.956311 0.959765 0.956311
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.998382 0.970874 0.973348 0.970874 0.971411
1 0.997571 0.987055 0.987362 0.987055 0.987085
2 0.999191 0.860841 0.865059 0.860841 0.862333
3 0.999191 0.961165 0.967778 0.961165 0.962261
4 0.995138 0.925566 0.935225 0.925566 0.928043
5 0.999191 0.944984 0.954156 0.944984 0.946625
6 0.999191 0.970874 0.973284 0.970874 0.971391
7 0.954807 0.938511 0.945552 0.938511 0.933507
Training Time Prediction Time \
0 60.541348 0.062399
1 19.059741 0.238296
2 1.743609 0.016801
3 3.144647 0.029608
4 280.731220 0.021044
5 105.556405 0.520020
6 0.355191 0.032632
7 4.036807 0.356221
Best Parameters
0 {'solver': 'saga', 'penalty': 'l2', 'max_iter': 500, 'C': 10}
1 {'max_iter': 10000, 'kernel': 'linear', 'gamma': 'auto', 'class_weight': 'balanced', 'C': 10}
2 {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'criterion': 'gini'}
3 {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'entropy'}
4 {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 3, 'learning_rate': 0.2}
5 {'subsample': 0.9, 'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2, 'colsample_bytree': 1.0}
6 {}
7 {'weights': 'uniform', 'p': 1, 'n_neighbors': 3}
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----- Results for Word2Vec -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.725728 0.726169 0.725728
1 Support Vector Machine 0.807443 0.808478 0.807443
2 Decision Tree 0.998382 0.998388 0.998382
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.998382 0.998385 0.998382
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.529935 0.593576 0.529935
7 K-Nearest Neighbors 0.999191 0.999194 0.999191
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.721778 0.692557 0.695657 0.692557 0.680702
1 0.805625 0.728155 0.733332 0.728155 0.724260
2 0.998382 0.828479 0.829496 0.828479 0.828308
3 0.999191 0.983819 0.983763 0.983819 0.983768
4 0.998382 0.967638 0.967603 0.967638 0.967239
5 0.999191 0.961165 0.962123 0.961165 0.959751
6 0.513007 0.537217 0.579248 0.537217 0.527450
7 0.999191 0.841424 0.835505 0.841424 0.832014
Training Time Prediction Time \
0 9.581789 0.005095
1 2.183543 0.082855
2 3.143800 0.003093
3 12.009001 0.026426
4 319.892342 0.006273
5 113.411497 0.365394
6 0.166228 0.005396
7 0.764167 0.043715
Best Parameters
0 {'solver': 'saga', 'penalty': 'l2', 'max_iter': 500, 'C': 10}
1 {'max_iter': 10000, 'kernel': 'rbf', 'gamma': 'scale', 'class_weight': None, 'C': 10}
2 {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'criterion': 'entropy'}
3 {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'gini'}
4 {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 5, 'learning_rate': 0.2}
5 {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 0.9}
6 {}
7 {'weights': 'distance', 'p': 1, 'n_neighbors': 3}
print("Glove Results")
display(glove_results)
Glove Results
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.997573 | 0.997576 | 0.997573 | 0.997573 | 0.941748 | 0.941760 | 0.941748 | 0.941599 | 11.900393 | 0.005751 | {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 500, 'C': 10} |
| 1 | Support Vector Machine | 0.997573 | 0.997583 | 0.997573 | 0.997573 | 0.964401 | 0.969511 | 0.964401 | 0.965489 | 2.059348 | 0.069366 | {'max_iter': 10000, 'kernel': 'rbf', 'gamma': 'scale', 'class_weight': 'balanced', 'C': 10} |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.857605 | 0.853864 | 0.857605 | 0.855199 | 2.920077 | 0.003244 | {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'criterion': 'gini'} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.987055 | 0.987190 | 0.987055 | 0.987094 | 11.566955 | 0.024556 | {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'gini'} |
| 4 | Gradient Boosting | 0.994337 | 0.994340 | 0.994337 | 0.994327 | 0.974110 | 0.974303 | 0.974110 | 0.973654 | 304.612072 | 0.005964 | {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 5, 'learning_rate': 0.2} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.977346 | 0.977750 | 0.977346 | 0.977273 | 115.302313 | 0.071594 | {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 0.9} |
| 6 | Naive Bayes | 0.576052 | 0.686802 | 0.576052 | 0.555990 | 0.576052 | 0.619135 | 0.576052 | 0.560298 | 0.167307 | 0.005420 | {} |
| 7 | K-Nearest Neighbors | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.873786 | 0.895224 | 0.873786 | 0.838124 | 0.701028 | 0.043943 | {'weights': 'distance', 'p': 1, 'n_neighbors': 3} |
print("TF-IDF Results")
display(tfidf_results)
TF-IDF Results
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.998382 | 0.998385 | 0.998382 | 0.998382 | 0.970874 | 0.973348 | 0.970874 | 0.971411 | 60.541348 | 0.062399 | {'solver': 'saga', 'penalty': 'l2', 'max_iter': 500, 'C': 10} |
| 1 | Support Vector Machine | 0.997573 | 0.997579 | 0.997573 | 0.997571 | 0.987055 | 0.987362 | 0.987055 | 0.987085 | 19.059741 | 0.238296 | {'max_iter': 10000, 'kernel': 'linear', 'gamma': 'auto', 'class_weight': 'balanced', 'C': 10} |
| 2 | Decision Tree | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.860841 | 0.865059 | 0.860841 | 0.862333 | 1.743609 | 0.016801 | {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'criterion': 'gini'} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.961165 | 0.967778 | 0.961165 | 0.962261 | 3.144647 | 0.029608 | {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'entropy'} |
| 4 | Gradient Boosting | 0.995146 | 0.995153 | 0.995146 | 0.995138 | 0.925566 | 0.935225 | 0.925566 | 0.928043 | 280.731220 | 0.021044 | {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 3, 'learning_rate': 0.2} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.944984 | 0.954156 | 0.944984 | 0.946625 | 105.556405 | 0.520020 | {'subsample': 0.9, 'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2, 'colsample_bytree': 1.0} |
| 6 | Naive Bayes | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.973284 | 0.970874 | 0.971391 | 0.355191 | 0.032632 | {} |
| 7 | K-Nearest Neighbors | 0.956311 | 0.959765 | 0.956311 | 0.954807 | 0.938511 | 0.945552 | 0.938511 | 0.933507 | 4.036807 | 0.356221 | {'weights': 'uniform', 'p': 1, 'n_neighbors': 3} |
print("Word2Vec Results")
display(word2vec_results)
Word2Vec Results
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.725728 | 0.726169 | 0.725728 | 0.721778 | 0.692557 | 0.695657 | 0.692557 | 0.680702 | 9.581789 | 0.005095 | {'solver': 'saga', 'penalty': 'l2', 'max_iter': 500, 'C': 10} |
| 1 | Support Vector Machine | 0.807443 | 0.808478 | 0.807443 | 0.805625 | 0.728155 | 0.733332 | 0.728155 | 0.724260 | 2.183543 | 0.082855 | {'max_iter': 10000, 'kernel': 'rbf', 'gamma': 'scale', 'class_weight': None, 'C': 10} |
| 2 | Decision Tree | 0.998382 | 0.998388 | 0.998382 | 0.998382 | 0.828479 | 0.829496 | 0.828479 | 0.828308 | 3.143800 | 0.003093 | {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'criterion': 'entropy'} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.983819 | 0.983763 | 0.983819 | 0.983768 | 12.009001 | 0.026426 | {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'gini'} |
| 4 | Gradient Boosting | 0.998382 | 0.998385 | 0.998382 | 0.998382 | 0.967638 | 0.967603 | 0.967638 | 0.967239 | 319.892342 | 0.006273 | {'validation_fraction': 0.1, 'n_iter_no_change': 5, 'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 5, 'learning_rate': 0.2} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.961165 | 0.962123 | 0.961165 | 0.959751 | 113.411497 | 0.365394 | {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 0.9} |
| 6 | Naive Bayes | 0.529935 | 0.593576 | 0.529935 | 0.513007 | 0.537217 | 0.579248 | 0.537217 | 0.527450 | 0.166228 | 0.005396 | {} |
| 7 | K-Nearest Neighbors | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.841424 | 0.835505 | 0.841424 | 0.832014 | 0.764167 | 0.043715 | {'weights': 'distance', 'p': 1, 'n_neighbors': 3} |
GloVe Embedding with Hypertuning:
TFIDF Features with Hypertuning:
Word2Vec Embedding with Hypertuning:
Insights and Comparison:
This comparison underscores the importance of hyperparameter tuning in enhancing model performance and generalization, particularly for complex models and high-dimensional embeddings like Word2Vec.
# Function to plot classification report for all the ML classifers with Hypertuning and training/prediction times
def plot_results(df, title):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Classification report heatmap
report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
sns.heatmap(report_data, annot=True, cmap='Blues', fmt='.2f', ax=ax1)
ax1.set_title(f'Classifier Performance - {title}')
# Training and prediction time comparison
df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
ax2.set_title(f'Training and Prediction Time - {title}')
ax2.set_ylabel('Time (seconds)')
plt.tight_layout()
plt.show()
# Plot results for each DataFrame (with hyperparameter tuning)
plot_results(glove_results, 'Glove Embeddings (Hyperparameter Tuning)')
plot_results(tfidf_results, 'TF-IDF Embeddings (Hyperparameter Tuning)')
plot_results(word2vec_results, 'Word2Vec Embeddings (Hyperparameter Tuning)')
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec alongwith Hypertuning without PCA
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_train_test_confusion_matrices_ht_no_pca(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle(f'Confusion Matrices for {df_name} (No PCA)', fontsize=16)
for i, (name, (clf, _)) in enumerate(classifiers.items()):
row = i // 4
col = i % 4
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(ax=axes[row, col], cmap='Blues')
axes[row, col].set_title(name)
plt.tight_layout()
plt.show()
plot_train_test_confusion_matrices_ht_no_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_train_test_confusion_matrices_ht_no_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_train_test_confusion_matrices_ht_no_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
Overall Observations:
Glove Embeddings with Hypertuning:
TF-IDF Features with Hypertuning:
Word2Vec Embeddings with Hypertuning:
Comparison with Non-Hyperparameter Tuned Models:
Train vs Test Confusion Matrices for all ML classifiers with Hypertuning
def plot_train_test_confusion_matrices_ht(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(8, 2, figsize=(20, 40))
fig.suptitle(f'Train and Test Confusion Matrices for {df_name} (Hyperparameter Tuning)', fontsize=15, y=0.98)
for i, (name, (clf, _)) in enumerate(classifiers.items()):
clf.fit(X_train, y_train)
# Train confusion matrix
y_train_pred = clf.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
disp_train.plot(ax=axes[i, 0], cmap='Blues')
axes[i, 0].set_title(f'{name} (Train)', fontsize=12)
# Test confusion matrix
y_test_pred = clf.predict(X_test)
cm_test = confusion_matrix(y_test, y_test_pred)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
disp_test.plot(ax=axes[i, 1], cmap='Blues')
axes[i, 1].set_title(f'{name} (Test)', fontsize=12)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
plot_train_test_confusion_matrices_ht(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_train_test_confusion_matrices_ht(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_train_test_confusion_matrices_ht(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
# Evaluating the performance of all the classifiers using Hypertuning and PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, make_scorer
from sklearn.metrics import ConfusionMatrixDisplay
import time
import pandas as pd
import matplotlib.pyplot as plt
# Assuming 'Final_NLP_Glove_df', 'Final_NLP_TFIDF_df', and 'Final_NLP_Word2Vec_df' are already defined
def apply_pca_and_split(df, n_components=0.99):
X = df.drop('Accident Level', axis=1)
y = df['Accident Level']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
if n_components < 1:
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)
else:
X_pca = X_scaled
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
return X_train, X_test, y_train, y_test
# Apply PCA and split for each dataframe
X_train_glove, X_test_glove, y_train_glove, y_test_glove = apply_pca_and_split(Final_NLP_Glove_df)
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = apply_pca_and_split(Final_NLP_TFIDF_df)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = apply_pca_and_split(Final_NLP_Word2Vec_df)
# Define classifiers and hyperparameter grids
classifiers = {
"Logistic Regression": (LogisticRegression(), {
'penalty': ['l1', 'l2'],
'C': [0.01, 0.1, 1, 10],
'solver': ['liblinear', 'saga'],
'max_iter': [100, 500, 1000]
}),
"Support Vector Machine": (SVC(), {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf', 'poly'],
'gamma': ['scale', 'auto'],
'class_weight': ['balanced']
}),
"Decision Tree": (DecisionTreeClassifier(), {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}),
"Random Forest": (RandomForestClassifier(), {
'n_estimators': [100, 200],
'criterion': ['gini', 'entropy'],
'max_depth': [10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['auto', 'sqrt']
}),
"Gradient Boosting": (GradientBoostingClassifier(), {
'n_estimators': [200],
'learning_rate': [0.2],
'max_depth': [3, 5, 7],
'min_samples_split': [10],
'min_samples_leaf': [4]
}),
"XG Boost": (XGBClassifier(), {
'n_estimators': [100],
'learning_rate': [0.2],
'max_depth': [3, 5, 7],
'subsample': [0.9]
}),
"Naive Bayes": (GaussianNB(), {}),
"K-Nearest Neighbors": (KNeighborsClassifier(), {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'p': [1, 2]
})
}
scoring = {
'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score, average='weighted'),
'recall': make_scorer(recall_score, average='weighted'),
'f1': make_scorer(f1_score, average='weighted')
}
def tune_and_evaluate_pca(X_train, X_test, y_train, y_test, embedding_name):
results = []
for name, (clf, param_grid) in classifiers.items():
start_time = time.time()
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring=scoring, refit='f1', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
training_time = time.time() - start_time
best_clf = grid_search.best_estimator_
y_train_pred = best_clf.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted')
train_recall = recall_score(y_train, y_train_pred, average='weighted')
train_f1 = f1_score(y_train, y_train_pred, average='weighted')
start_time = time.time()
y_test_pred = best_clf.predict(X_test)
prediction_time = time.time() - start_time
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
results.append([name,
train_accuracy, train_precision, train_recall, train_f1,
test_accuracy, test_precision, test_recall, test_f1,
training_time, prediction_time, grid_search.best_params_])
columns = ['Classifier',
'Train Accuracy', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Accuracy', 'Test Precision', 'Test Recall', 'Test F1-score',
'Training Time', 'Prediction Time', 'Best Parameters']
df = pd.DataFrame(results, columns=columns)
print(f"----- Results for {embedding_name} (with Hypertuning & PCA) -----")
print(df)
return df
glove_results_ht_pca = tune_and_evaluate_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, "Glove")
tfidf_results_ht_pca = tune_and_evaluate_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, "TF-IDF")
word2vec_results_ht_pca = tune_and_evaluate_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, "Word2Vec")
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
----- Results for Glove (with Hypertuning & PCA) -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.998382 0.998385 0.998382
1 Support Vector Machine 0.996764 0.996777 0.996764
2 Decision Tree 0.988673 0.988722 0.988673
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.999191 0.999194 0.999191
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.907767 0.909527 0.907767
7 K-Nearest Neighbors 0.999191 0.999194 0.999191
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.998382 0.954693 0.954891 0.954693 0.954703
1 0.996762 0.964401 0.968638 0.964401 0.965181
2 0.988671 0.802589 0.816676 0.802589 0.807753
3 0.999191 0.961165 0.965057 0.961165 0.961746
4 0.999191 0.970874 0.972769 0.970874 0.971221
5 0.999191 0.970874 0.973752 0.970874 0.971284
6 0.906243 0.834951 0.845118 0.834951 0.835399
7 0.999191 0.870550 0.880995 0.870550 0.836126
Training Time Prediction Time \
0 77.103474 0.000402
1 2.277558 0.057942
2 14.266272 0.000420
3 111.870976 0.011469
4 273.573186 0.006782
5 16.019643 0.004797
6 0.150775 0.001675
7 0.635684 0.003601
Best Parameters
0 {'C': 0.1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'saga'}
1 {'C': 10, 'class_weight': 'balanced', 'gamma': 'auto', 'kernel': 'rbf'}
2 {'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5}
3 {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
4 {'learning_rate': 0.2, 'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
5 {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.9}
6 {}
7 {'n_neighbors': 3, 'p': 2, 'weights': 'distance'}
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
----- Results for TF-IDF (with Hypertuning & PCA) -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.997573 0.997579 0.997573
1 Support Vector Machine 0.996764 0.996777 0.996764
2 Decision Tree 0.984628 0.984850 0.984628
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.999191 0.999194 0.999191
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.789644 0.819981 0.789644
7 K-Nearest Neighbors 0.999191 0.999194 0.999191
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.997571 0.977346 0.978440 0.977346 0.976967
1 0.996762 0.993528 0.993700 0.993528 0.993541
2 0.984633 0.886731 0.890585 0.886731 0.887953
3 0.999191 0.983819 0.984205 0.983819 0.983653
4 0.999191 0.983819 0.983799 0.983819 0.983757
5 0.999191 0.977346 0.977508 0.977346 0.977212
6 0.779825 0.786408 0.810530 0.786408 0.784380
7 0.999191 0.877023 0.917915 0.877023 0.850527
Training Time Prediction Time \
0 143.833270 0.000498
1 3.837126 0.079515
2 22.046987 0.000443
3 134.131141 0.011322
4 471.360891 0.005865
5 30.384503 0.001787
6 0.142238 0.002696
7 0.923243 0.004865
Best Parameters
0 {'C': 0.01, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}
1 {'C': 10, 'class_weight': 'balanced', 'gamma': 'scale', 'kernel': 'rbf'}
2 {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5}
3 {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
4 {'learning_rate': 0.2, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
5 {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.9}
6 {}
7 {'n_neighbors': 3, 'p': 2, 'weights': 'distance'}
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
----- Results for Word2Vec (with Hypertuning & PCA) -----
Classifier Train Accuracy Train Precision Train Recall \
0 Logistic Regression 0.996764 0.996771 0.996764
1 Support Vector Machine 0.997573 0.997589 0.997573
2 Decision Tree 0.986246 0.986295 0.986246
3 Random Forest 0.999191 0.999194 0.999191
4 Gradient Boosting 0.999191 0.999194 0.999191
5 XG Boost 0.999191 0.999194 0.999191
6 Naive Bayes 0.907767 0.911257 0.907767
7 K-Nearest Neighbors 0.907767 0.916753 0.907767
Train F1-score Test Accuracy Test Precision Test Recall Test F1-score \
0 0.996761 0.941748 0.943847 0.941748 0.942467
1 0.997573 0.967638 0.970994 0.967638 0.968170
2 0.986249 0.773463 0.784403 0.773463 0.775144
3 0.999191 0.967638 0.971403 0.967638 0.968343
4 0.999191 0.970874 0.972674 0.970874 0.971157
5 0.999191 0.967638 0.970347 0.967638 0.967919
6 0.907027 0.844660 0.856784 0.844660 0.844113
7 0.898325 0.880259 0.894452 0.880259 0.853160
Training Time Prediction Time \
0 69.178033 0.000391
1 2.009258 0.044738
2 12.682643 0.000358
3 105.257361 0.022254
4 250.149165 0.006686
5 14.594018 0.001569
6 0.140277 0.001625
7 0.568239 0.003531
Best Parameters
0 {'C': 0.1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'saga'}
1 {'C': 10, 'class_weight': 'balanced', 'gamma': 'scale', 'kernel': 'rbf'}
2 {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5}
3 {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
4 {'learning_rate': 0.2, 'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
5 {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.9}
6 {}
7 {'n_neighbors': 3, 'p': 2, 'weights': 'uniform'}
print("Glove Results (with Hypertuning & PCA)")
display(glove_results_ht_pca)
Glove Results (with Hypertuning & PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.998382 | 0.998385 | 0.998382 | 0.998382 | 0.954693 | 0.954891 | 0.954693 | 0.954703 | 77.103474 | 0.000402 | {'C': 0.1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'saga'} |
| 1 | Support Vector Machine | 0.996764 | 0.996777 | 0.996764 | 0.996762 | 0.964401 | 0.968638 | 0.964401 | 0.965181 | 2.277558 | 0.057942 | {'C': 10, 'class_weight': 'balanced', 'gamma': 'auto', 'kernel': 'rbf'} |
| 2 | Decision Tree | 0.988673 | 0.988722 | 0.988673 | 0.988671 | 0.802589 | 0.816676 | 0.802589 | 0.807753 | 14.266272 | 0.000420 | {'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.961165 | 0.965057 | 0.961165 | 0.961746 | 111.870976 | 0.011469 | {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100} |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.972769 | 0.970874 | 0.971221 | 273.573186 | 0.006782 | {'learning_rate': 0.2, 'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.973752 | 0.970874 | 0.971284 | 16.019643 | 0.004797 | {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.9} |
| 6 | Naive Bayes | 0.907767 | 0.909527 | 0.907767 | 0.906243 | 0.834951 | 0.845118 | 0.834951 | 0.835399 | 0.150775 | 0.001675 | {} |
| 7 | K-Nearest Neighbors | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.870550 | 0.880995 | 0.870550 | 0.836126 | 0.635684 | 0.003601 | {'n_neighbors': 3, 'p': 2, 'weights': 'distance'} |
print("TF-IDF Results (Hypertuning & PCA)")
display(tfidf_results_ht_pca)
TF-IDF Results (Hypertuning & PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.997573 | 0.997579 | 0.997573 | 0.997571 | 0.977346 | 0.978440 | 0.977346 | 0.976967 | 143.833270 | 0.000498 | {'C': 0.01, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'} |
| 1 | Support Vector Machine | 0.996764 | 0.996777 | 0.996764 | 0.996762 | 0.993528 | 0.993700 | 0.993528 | 0.993541 | 3.837126 | 0.079515 | {'C': 10, 'class_weight': 'balanced', 'gamma': 'scale', 'kernel': 'rbf'} |
| 2 | Decision Tree | 0.984628 | 0.984850 | 0.984628 | 0.984633 | 0.886731 | 0.890585 | 0.886731 | 0.887953 | 22.046987 | 0.000443 | {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.983819 | 0.984205 | 0.983819 | 0.983653 | 134.131141 | 0.011322 | {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100} |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.983819 | 0.983799 | 0.983819 | 0.983757 | 471.360891 | 0.005865 | {'learning_rate': 0.2, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.977346 | 0.977508 | 0.977346 | 0.977212 | 30.384503 | 0.001787 | {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.9} |
| 6 | Naive Bayes | 0.789644 | 0.819981 | 0.789644 | 0.779825 | 0.786408 | 0.810530 | 0.786408 | 0.784380 | 0.142238 | 0.002696 | {} |
| 7 | K-Nearest Neighbors | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.877023 | 0.917915 | 0.877023 | 0.850527 | 0.923243 | 0.004865 | {'n_neighbors': 3, 'p': 2, 'weights': 'distance'} |
print("Word2Vec Results (Hypertuning & PCA)")
display(word2vec_results_ht_pca)
Word2Vec Results (Hypertuning & PCA)
| Classifier | Train Accuracy | Train Precision | Train Recall | Train F1-score | Test Accuracy | Test Precision | Test Recall | Test F1-score | Training Time | Prediction Time | Best Parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.996764 | 0.996771 | 0.996764 | 0.996761 | 0.941748 | 0.943847 | 0.941748 | 0.942467 | 69.178033 | 0.000391 | {'C': 0.1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'saga'} |
| 1 | Support Vector Machine | 0.997573 | 0.997589 | 0.997573 | 0.997573 | 0.967638 | 0.970994 | 0.967638 | 0.968170 | 2.009258 | 0.044738 | {'C': 10, 'class_weight': 'balanced', 'gamma': 'scale', 'kernel': 'rbf'} |
| 2 | Decision Tree | 0.986246 | 0.986295 | 0.986246 | 0.986249 | 0.773463 | 0.784403 | 0.773463 | 0.775144 | 12.682643 | 0.000358 | {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5} |
| 3 | Random Forest | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.967638 | 0.971403 | 0.967638 | 0.968343 | 105.257361 | 0.022254 | {'criterion': 'entropy', 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200} |
| 4 | Gradient Boosting | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.970874 | 0.972674 | 0.970874 | 0.971157 | 250.149165 | 0.006686 | {'learning_rate': 0.2, 'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200} |
| 5 | XG Boost | 0.999191 | 0.999194 | 0.999191 | 0.999191 | 0.967638 | 0.970347 | 0.967638 | 0.967919 | 14.594018 | 0.001569 | {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.9} |
| 6 | Naive Bayes | 0.907767 | 0.911257 | 0.907767 | 0.907027 | 0.844660 | 0.856784 | 0.844660 | 0.844113 | 0.140277 | 0.001625 | {} |
| 7 | K-Nearest Neighbors | 0.907767 | 0.916753 | 0.907767 | 0.898325 | 0.880259 | 0.894452 | 0.880259 | 0.853160 | 0.568239 | 0.003531 | {'n_neighbors': 3, 'p': 2, 'weights': 'uniform'} |
GloVe Embedding with Hypertuning & PCA:
TFIDF Features with Hypertuning & PCA:
Word2Vec Embedding with Hypertuning & PCA:
Insights and Comparison:
This comparison underscores the importance of combining hyperparameter tuning with dimensionality reduction techniques like PCA to enhance model performance and generalization, particularly for complex models and high-dimensional embeddings like Word2Vec.
Conclusion:
# Function to plot classification report for all the ML classifers with Hypertuning and PCA and training/prediction times
def plot_results(df, title):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Classification report heatmap
report_data = df[['Classifier', 'Train Precision', 'Train Recall', 'Train F1-score',
'Test Precision', 'Test Recall', 'Test F1-score']].set_index('Classifier')
sns.heatmap(report_data, annot=True, cmap='Greens', fmt='.2f', ax=ax1)
ax1.set_title(f'Classifier Performance - {title}')
# Training and prediction time comparison
df.plot(x='Classifier', y=['Training Time', 'Prediction Time'], kind='bar', ax=ax2, cmap='Set3')
ax2.set_title(f'Training and Prediction Time - {title}')
ax2.set_ylabel('Time (seconds)')
plt.tight_layout()
plt.show()
# Plot results for each DataFrame (with hyperparameter tuning and PCA)
plot_results(glove_results_ht_pca, 'Glove Embeddings (Hyperparameter Tuning with PCA)')
plot_results(tfidf_results_ht_pca, 'TF-IDF Embeddings (Hyperparameter Tuning with PCA)')
plot_results(word2vec_results_ht_pca, 'Word2Vec Embeddings (Hyperparameter Tuning with PCA)')
# Function to plot confusion matrix against all classifiers with word embeddings generated using Glove, TF-IDF, Word2Vec alongwith Hypertuning with PCA
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion_matrices_ht_pca(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle(f'Confusion Matrices for {df_name} (with PCA and Hyperparameter Tuning)', fontsize=16)
for i, (name, (clf, _)) in enumerate(classifiers.items()):
row = i // 4
col = i % 4
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(ax=axes[row, col], cmap='Greens')
axes[row, col].set_title(name)
plt.tight_layout()
plt.show()
plot_confusion_matrices_ht_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_confusion_matrices_ht_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_confusion_matrices_ht_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
Overall Performance:
Glove Embeddings Base Classifiers + Hypertuning + PCA:
TF-IDF Features Base Classifiers + Hypertuning + PCA:
Word2Vec Embeddings Base Classifiers + Hypertuning + PCA:
Class-specific observations:
Model Complexity:
Embedding Effectiveness:
Comparison with Previous Results:
Train vs Test Confusion Matrices for all ML classifiers with Hypertuning & PCA
def plot_train_test_confusion_matrices_ht_pca(X_train, X_test, y_train, y_test, df_name):
fig, axes = plt.subplots(8, 2, figsize=(20, 40))
fig.suptitle(f'Train and Test Confusion Matrices for {df_name} (Hyperparameter Tuning with PCA)', fontsize=15, y=0.98)
for i, (name, (clf, _)) in enumerate(classifiers.items()):
clf.fit(X_train, y_train)
# Train confusion matrix
y_train_pred = clf.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=clf.classes_)
disp_train.plot(ax=axes[i, 0], cmap='Greens')
axes[i, 0].set_title(f'{name} (Train)', fontsize=12)
# Test confusion matrix
y_test_pred = clf.predict(X_test)
cm_test = confusion_matrix(y_test, y_test_pred)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=clf.classes_)
disp_test.plot(ax=axes[i, 1], cmap='Greens')
axes[i, 1].set_title(f'{name} (Test)', fontsize=12)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
plot_train_test_confusion_matrices_ht_pca(X_train_glove, X_test_glove, y_train_glove, y_test_glove, 'Glove Embeddings')
plot_train_test_confusion_matrices_ht_pca(X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf, 'TF-IDF Features')
plot_train_test_confusion_matrices_ht_pca(X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec, 'Word2Vec Embeddings')
Overall Performance Improvement:
Consistent Top Performers:
Feature Set Comparison:
Impact of PCA:
Hypertuning Benefits:
Trade-offs:
Prioritize Ensemble Methods: Focus on Random Forest, Gradient Boosting, and XGBoost as your primary models, as they consistently deliver top performance.
Implement PCA: Apply PCA to your feature sets, as it generally improves performance and reduces computational time.
Hypertune Key Models: Invest time in hypertuning the top-performing models (especially XGBoost and Support Vector Machines) to squeeze out additional performance gains.
Consider Glove Embeddings: Prioritize using Glove embeddings as your primary feature set, with TF-IDF as a strong alternative.
Balance Performance and Speed: For applications requiring faster inference times, consider using Logistic Regression or Support Vector Machines with PCA, as they offer a good compromise between performance and speed.
Ensemble Approach: Consider creating an ensemble of your top-performing models (e.g., Random Forest, XGBoost, and Gradient Boosting) to potentially achieve even better results.
Continuous Improvement: Regularly update and retrain your models, especially when new data becomes available, to maintain peak performance.
Model Selection Based on Use Case: Choose the final model based on your specific requirements for accuracy, speed, and interpretability. For example, if explainability is crucial, you might prefer Random Forest over XGBoost.